Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controlled vocabulary for materialSampleType #24

Closed
5 tasks
baskaufs opened this issue Apr 19, 2022 · 44 comments
Closed
5 tasks

controlled vocabulary for materialSampleType #24

baskaufs opened this issue Apr 19, 2022 · 44 comments

Comments

@baskaufs
Copy link

As requested in the 2022-03-16 meeting, I have created a draft controlled vocabulary for the proposed materialSampleType term based on the existing specimen types. It can be viewed as a list of terms document and in tabular form.

I believe the decision was to start with the existing specimen types, with the option of adding other values if we could agree upon what they should be. The vocabulary is easy to expand by just adding more rows to the source CSV (linked above).

Additional things to be resolved:

  • Is PreservedSpecimen actually a broader concept of FossilSpecimen? The definition suggests that, so I put it in the metadata, but I'm not sure if that's right. It is allowed, but not required, to have hierarchical relationships among SKOS concepts and there is precedent within TDWG controlled vocabularies for doing that if desired.
  • For the value of the property that shows where the definition was derived from, I used the particular version of the Darwin Core class from which I copied the definition and examples. For example: http://rs.tdwg.org/dwc/terms/version/LivingSpecimen-2018-09-06 . I think this is the right approach, even if those class terms eventually become deprecated. I considered using the BCO IRIs, but they are now marked as obsolete and the metadata wasn't exactly the same as what's currently in DwC.
  • For the controlled value strings, I used UpperCamelCase, e.g. FossilSpecimen, rather than lowerCamelCase (which would be fossilSpecimen) as has become somewhat standard for other controlled vocabularies. However, since UpperCamelCase is to some extent already in use or assumed for these terms, I thought it better to stick with that.
  • There is an opportunity to provide normative usage guidelines (not common for TDWG-minted terms) and non-normative notes (an option if we need them), but I didn't supply any. We can add them if desired.
  • For the preferred namespace abbreviation, I used dwcmatter. I think it's best to include the "dwc" part if the namespace gets used beyond TDWG and I used "matter" instead of something longer because places that track namespace abbreviations such as Linked Open Vocabularies (LOV) won't accept namespace abbreviations that are too long (e.g. tdwgutility is too long). I'm not sure what the character limit number is. But this seemed to encapsulate what the CV is talking about (things that are matter as opposed to information) and the abbreviation doesn't really have any normative meaning -- it's just for convenience. Also, most people will use the controlled value strings rather than IRIs anyway. But we could make it something else if desired.
@baskaufs
Copy link
Author

reference #14

@baskaufs
Copy link
Author

Ping @Jegelewicz

@Jegelewicz
Copy link
Collaborator

@baskaufs THANK YOU! We can take a deep dive tomorrow?

@RogerBurkhalter
Copy link

@baskaufs I will comment on the definition of FossilSpecimen as "A preserved specimen that is a fossil". I advocate removing the term "preserved". First, the process of fossilization is preservation in and of itself. It is also a "natural" process, not an anthropogenic process such as described under most of dwc:preparations (also using the term "preserved", many paleo workers take issue with fossils being in that term as well, for fossils preparation has nothing to do with how the item is stored). The concept of SkinSpecimen or SkeletalSpecimen representing Skin or Skeletal preservation/preparation does not make sense. There may be fringe instances where a recently living entity may be naturally preserved, such as a desiccated mammal or bird, or a frozen Mammoth body that is not otherwise "turned to stone" that is usually envisioned as fossils. Perhaps these fringe instances can be included in PreservedSpecimen with preparation type as naturally desiccated or naturally frozen. They do not represent fossils (although some would argue the Mammoth is a fossil). Speaking of fossil, the examples include items that are not fossils but are instead examples of behavior. These are coprolites, gastroliths, and ichnofossils. These are all forms of ichnofossils and as such cannot (usually) be directly attributed to any particular species, and should perhaps be a form of specimens of a "preserved observation"?

@cboelling
Copy link
Member

I would like to understand why, in order to specify values for dwc:MaterialSampleType a new concept scheme with newly defined resources (a.k.a terms) is preferred (including concept scheme infrastructure like name spaces, IRIs). At first glance it seems that what is informative about the newly minted resources can also be expressed with the existing terms, e.g. using http://rs.tdwg.org/dwc/terms/version/LivingSpecimen-2018-09-06 or its associated label ("Living Specimen") or adaptations thereof. Couldn't those be used as values?

@Jegelewicz
Copy link
Collaborator

Perhaps add a requested term?

environmentalSample

@Jegelewicz
Copy link
Collaborator

Also, GGBN will be concerned looking for "tissue".....

@Jegelewicz
Copy link
Collaborator

I will comment on the definition of FossilSpecimen as "A preserved specimen that is a fossil". I advocate removing the term "preserved".

I agree with removing preserved from this definition.

However, won't we have some FossilSpecimens that are also PreservedSpecimen? Like these fossil scutes prepared as thin sections? https://arctos.database.museum/guid/NMMNH:Paleo:16545

I guess this also means there will need to be a whole other term for the description of the material "scute"? Should we be looking at that here or passing that down to the next task group?

@Jegelewicz
Copy link
Collaborator

the examples include items that are not fossils but are instead examples of behavior. These are coprolites, gastroliths, and ichnofossils. These are all forms of ichnofossils and as such cannot (usually) be directly attributed to any particular species, and should perhaps be a form of specimens of a "preserved observation"?

I suggest we probably need a type for this in controlled vocabulary, but I also suggest "trace" rather than "PreservedObservation" which seems like it could also be used for a photograph. Trace could also cover things like scat, molds of footprints and such.

@deepreef
Copy link

However, won't we have some FossilSpecimens that are also PreservedSpecimen

I think there's a subtle distinction between "preserved" and "prepared" (or "curated"). When you think about it, every physical object is in some way preserved. There are varying degrees of the duration of preservation. Fossils, through mineralization, are preserved for millions of years. Specimens treated with formaldehyde and/or alcohol are preserved for (potentially) centuries. Tissue samples stored in DMSO are preserved for decades(?) A fresh carcass in an air-conditioned room is preserved for days or perhaps weeks.

I guess my point is that the word "preserved" is a bit meaningless and/or implied by the word "specimen". What I think people are actually interested in is the current state and history of "preparations". That is, what sorts of actions have been performed on a physical thing? Some of these actions are intended to extend the duration of preservation (e.g., formalin, alcohol, etc.). Some of them are intended to allow examination or analysis (e.g., thin sections, tissue extractions for DNA, etc.)

I get that we want to distinguished "Preserved" Specimens from "Living" Specimens, but it would seem to me that the alternative of "Living" is actually "Dead", not "Preserved". Yeah, I know what we mean by "Preserved" specimen; but given that we're taking the time to completely restructure how these terms are applied; perhaps now is a good time to rethink how we parse out the different kinds of MaterialSample instances?

One option is to do as we have been doing, which is sort of "overload" a basic term like materialSampleType to capture clues about preservation method, living vs. dead, mineralized vs. actual biological material, whole organism vs. part of organism vs. aggregate of multiple organisms, vs. which part of an organism it is, etc. I worry that trying to capture all these disparate properties in some controlled vocabulary of terms squeezed into materialSampleType might make things more complicated, rather than more simple.

Maybe a good topic of discussion for today's chat would be "What parameters are we trying to represent in the values of materialSampleType?

@Jegelewicz
Copy link
Collaborator

Maybe a good topic of discussion for today's chat would be "What parameters are we trying to represent in the values of materialSampleType?

Added to the agenda!

@smrgeoinfo
Copy link

see baskaufs/msc#1 (comment)

@smrgeoinfo
Copy link

#24 (comment) 's question 'What parameters are we trying to represent in the values of materialSampleType?' is important. For a controlled vocabulary, it is very useful to have a clear definition of the use case for the vocabulary, its scope (biological samples, any material sample, Earth Materials....), what are the criteria for differentiating the terms, are the terms hierarchical, do the terms cover the scope (covering), can terms have overlapping meaning (Unique, unambiguous).

@albenson-usgs
Copy link

albenson-usgs commented Apr 20, 2022

Perhaps add a requested term?
environmentalSample

I think this is too broad. I would like to use the examples from tdwg/dwc#40 e.g. Examples: envo:soil, envo:sediment, envo:saline water

I think being able to distinguish a soil sample vs. a saline water sample vs. a freshwater sample will be important to eDNA data providers.

@Jegelewicz
Copy link
Collaborator

Jegelewicz commented Apr 21, 2022

'What parameters are we trying to represent in the values of materialSampleType?' is important. For a controlled vocabulary, it is very useful to have a clear definition of the use case for the vocabulary, its scope (biological samples, any material sample, Earth Materials....), what are the criteria for differentiating the terms, are the terms hierarchical, do the terms cover the scope (covering), can terms have overlapping meaning (Unique, unambiguous).

In the second meeting yesterday, we discussed this. Those present could see the need for thinking beyond the currently used "GBIF basisOfRecord" terms and @albenson-usgs suggested that we take a step back and start by creating a list of terms we think we might find or want to place in materialSampleType. So, I have started a Google Sheet and I would like everyone to think about what they might place in this vocabulary. Just add your terms to the bottom of "suggested vocabulary". We can then deduplicate the list and start categorizing to see if we can build a more broad and useful vocabulary. In addition, I think it would be helpful for each of us to think about the quote above. What do we expect from the vocabulary for this term?

@baskaufs
Copy link
Author

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here. The terms are intended to be used as values for ac:subjectPart, which indicates the part of the organism being photographed, but it could generally refer to organism parts in other contexts.

@Jegelewicz
Copy link
Collaborator

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here.

Added to Google Sheet

@baskaufs
Copy link
Author

@cboelling To respond to your question

I would like to understand why, in order to specify values for dwc:MaterialSampleType a new concept scheme with newly defined resources (a.k.a terms) is preferred (including concept scheme infrastructure like name spaces, IRIs). At first glance it seems that what is informative about the newly minted resources can also be expressed with the existing terms, e.g. using http://rs.tdwg.org/dwc/terms/version/LivingSpecimen-2018-09-06 or its associated label ("Living Specimen") or adaptations thereof. Couldn't those be used as values?

The controlled vocabulary as I generated it follows the conventions that have been established within TDWG for ratified controlled vocabularies. One of the goals of that system is to eliminate longstanding confusion between term labels, IRI local names, and the controlled value strings that people should use in spreadsheets or tables. These three things have been badly conflated in the past. That's a problem because TDWG is an international organization and labels are (or should be) available in many languages, whereas there should be a single controlled value string used by everyone as a value for the property. You can see examples under the three existing controlled vocabularies within Darwin Core (for establishmentMeans, pathway, and degreeOfEstablishment), available from the top navigation bar on the Darwin Core website. The intent is for this vocabulary to follow the same pattern. These controlled vocabularies now have some label translations available at https://tdwg.github.io/rs.tdwg.org/ .

The IRI local names are intentionally opaque so that no one is tempted to try to use them as controlled value strings. But since there are IRIs and JSON-LD using them, one can encode SKOS relationships among concepts (such as skos:broader) in a machine-readable way. See https://tdwg.github.io/rs.tdwg.org/cvJson/pathway.json for example.

@jbstatgen
Copy link

Coming from GRSciColl and working on describing "Institutions" and "Collections", I added a couple of terms to the end of the list, as well as an additional sheet with the two existing vocabularies for the fields/properties describing "Collection": "Content types" and "Preparation types".

Both input fields don't work, that is, a csv-download of the information stored in GRSciColl shows that both fields are generally empty, or users add information that doesn't make a lot of sense when compared with the rest of the entered information. Obviously they need to be redesigned. Nevertheless, they can provide an idea and perspective about dimensions associated with describing MaterialSampleType and granularity.

For further background, since there is a bit of overlap too, this is my proposal for how to describe "Institution" GRSciColl_Vocabs . Comments are very much welcome, though since out of scope here, please to me directly.

@jbstatgen
Copy link

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here. The terms are intended to be used as values for ac:subjectPart, which indicates the part of the organism being photographed, but it could generally refer to organism parts in other contexts.

@baskaufs ... no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont)

@baskaufs
Copy link
Author

@jbstatgen

no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont)

We begged people to participate in this task group and no fungi experts joined. So we only have values for organism groups where someone suggested them.

The controlled vocabulary is intended to be extensible, so we'd be happy to add fungi if someone will suggest the terms, test with images, etc.

@baskaufs
Copy link
Author

@dr-shorthair

I'd suggest being more clear about which strings are keys, in what context; and which strings are being stored as 'annotations' related to some prior context.

I don't understand what you are saying. Please refer to the governing specification, Sections 3.3.3.1 ("Controlled value") and 4.5.4 and offer suggestions on how they need to be clarified.

The approach taken there was a compromise between how concept metadata are described in "pure" SKOS thesauri and the actual practice within TDWG of simply using a certain plain text string as a value from a "controlled vocabulary".

@dr-shorthair
Copy link

dr-shorthair commented Apr 25, 2022

Apologies - my comment was intended to be in the context of IDs. I'll try to find the thread I thought I was responding to. We can delete these bits of this conversation so that this issue does not have a confusing sub-thread.

@jbstatgen
Copy link

...
no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont)
...
The controlled vocabulary is intended to be extensible, so we'd be happy to add fungi if someone will suggest the terms, test with images, etc.

@baskaufs What would it take to add the above terms to your vocabulary?

A) If it is a matter of the amount of information present in this overview and the first two links in your initial post, I could provide this for the above terms and learn along the way about how to construct and publish vocabularies correctly.

B) Though, there wouldn't be any testing and community agreement supporting the contributed terms. For that, the vocabularies need the mycologists and lichenologists eg. from the citizen science initiatives for fungi.

C) This is the Task Group you were mentioning. Your report for 2021 suggests that you are wrapping up and might not want to reopen the process.

Not sure where the balance in all of this is right now.

@Jegelewicz
Copy link
Collaborator

@dr-shorthair no worries - just copy and repost wherever you want to comment!

@smrgeoinfo
Copy link

I spent some time studying the draft controlled vocabulary (tabular form), and have some thoughts....
First, as a geologist and engineer, I don't know what a lot of the terms mean and didn't have time to look them all up, so this analysis is based on terms I think I understand.

  • This vocabulary starts down a VERY slippery slope to a vocabulary with all terms for all anatomical parts for all living or once-living things. This can not scale-- a 'controlled vocabulary' with thousands of terms is hard (impossible?) to use. Usage of really granular terms needs to be scoped to the community that uses them. The controlled vocabulary (IMHO) is intended to help users filter out stuff to focus on what they're interested in, not select a facet term for the specific bone they're interested in.
  • a number of the terms appear to me to be adjectives (e.g. angular, articular, basibranchial, basioccipital, exoccipital, frontal). These are not appropriate as 'SampleTypes'
  • There are a couple of things that aren't material samples, e.g. 'observation', 'model'
  • 314 of 457 Terms have been mapped into the iSamples high level vocabulary for SpecimenType, mostly Organism Part (126) and Organism Product (41).

Perhaps a next step here is looking for some more general categories to lump categories into a vocabulary with a manageable number of classes, say on the order of a 100 or so. And make them hierarchical.
Maybe something like Organism > plant organism > plant organism part along one branch.

factoring specimen type along the lines of say ... object type, material type, sampled feature, taxonomic class, anatomic class... would allow defining a smaller set of categories, and then allowing users to build detail vocabularies that map into combinations of those high-level categories.

@RogerBurkhalter
Copy link

@smrgeoinfo many of the terms you cite as adjectives are indeed individual bones (angular, articular, basibranchial, basioccipital, exoccipital, frontal) and may be important, especially for vertebrate paleontology where a complete skeleton is not found or only isolated bones are known. I do agree the list is painfully long but, as is, incomplete with all of the possible terms. The list I use in my CMS is hierarchical has a "modifier" to handle adjectives like anterior partial, left lateral partial, etc., because not everything is complete in the paleo realm.

@smrgeoinfo
Copy link

So in a hierarchical vocabulary, one might have something like:
whole organism > vertebrate organism > vertebrate body part > vertebrate bone > endochondral bone > basibranchial bone > Gymnura micrura basibranchial medial plate. For a TDWG materialSampleType vocabulary, the question is what is the useful level of granularity in this hierarchy; more detailed categorization would then fall in some free text field, or use a local, more granular vocabulary specific to some sub-community.

@tucotuco
Copy link
Member

Rather than try to build the vocabulary for anatomical parts, I would recommend the use of a SKOS-ified version of UBERON, the construction of which could be scripted and updated at any time.

@RogerBurkhalter
Copy link

@tucotuco UBERON, works for the living, not so well for the fossil groups. It is a great start and I will explore further.

@albenson-usgs
Copy link

Perhaps a next step here is looking for some more general categories to lump categories into a vocabulary with a manageable number of classes, say on the order of a 100 or so.

I want to make clear that when I suggested this task that is what I had intended would happen. In the How Did It Die Task Group this is what we did to come up with the vocabulary for causeOfDeath, see here where we have a full slate of what's currently in some of the databases for cause of death and then the lumping categories of Natural - abiotic, Natural - biotic, Anthropogenic, Unknown. I would hope we could get to a lumped list of 10 or so personally :-) We are going to overwhelm data providers if we make the list too long.

@baskaufs
Copy link
Author

baskaufs commented May 2, 2022

@jbstatgen I've started a new issue tdwg/ac#240 in the Audubon Core repository regarding fungal parts to avoid getting this one off the track. We can continue the discussion there.

@Jegelewicz
Copy link
Collaborator

The categories from GRSciColl Collection ContentType seem broad and relevant. Could these terms also be used as materialSampleType?

That may seem repetitive, but any given collection probably includes more than one of the ContentType(s), allowing the addition of this "tag" to every record would seem potentially useful. However, they still seem oddly specific in some cases. How about the broader categorical terms?

Archaeological
Biological
Human Derived
Earth Planetary
Paleontological
Record

Really, it seems like the broader terms belong with the collection description and the more detailed values with the individual records, but I could see it going either way...

@smrgeoinfo
Copy link

This mapping includes the GRSciColl terms.

@jbstatgen
Copy link

jbstatgen commented May 3, 2022

Really, it seems like the broader terms belong with the collection description and the more detailed values with the individual records, but I could see it going either way...

Wouldn't this be the perfect situation for an ontology, ie. a hierarchical classification? In that way one could automatically generate the aggregate of a collection's contents at any level.

Archaeological
Biological
Human Derived
Earth Planetary
Paleontological
Record

I like this high-level approach, though there are a couple of reasons why I would like to see the list of terms modified.

  1. In our field we are dealing mostly with things "Biological". Thus, basically any record could get a tag "Biological", which then isn't informative anymore. Should we go that high, the list would be, it seems
Geological
Biological
Anthropogenic

[Record (what does "Record" refer to? Is that a subclass of Anthropogenic?)]

In a hierarchical approach this could be Level 1
With Level 0 being "material sample" vs. "information artifact".

  1. Level 2 within "Biological" will be most informative for many of our use cases. Here I am suggesting
Virology
Microbiology _(Would one want to split Bacteriology from Microbiology? That is, Bacteria, Archaebacteria versus the rest of all those evolutionary dispersed lineages of microorganisms?)_
Mycology
Zoology _(How important is an immediate split into invertebrates - vertebrates?)_
Botany
Paleontology _(human remains go into Anthropology, right?)_
Biomedical _(or any term referring to human biology - and yes, actually this is Zoology)_
  1. Level 2 within "Geology": a distinction between planetary vs. extraterrestrial seems to be of interest, though I'm not familiar what the correct/widely used terms might be. For example
Planetary/Terrestrial/Earth
Extraterrestrial with WithinSolarSystem vs. ExtrasolarSystem 

Alternatively, would it be "Geology" vs. "Astronomy"?

  1. Level 2 within "Anthropogenic": for me this is anything made by humans. Also, the distinction between archaeology and anthropology doesn't seem to be clear-cut. Eg. is the https://en.wikipedia.org/wiki/Ahrensburg_culture down the road just outside town "Anthropology" or "Archaeology"? - Its "a bunch of rocks in a circle and a couple of arrow tips" (Archaeology or cultural anthropology?). I'm not sure how many human remains/bones were found (Anthropology?), if any - though that seems to be dependent mostly on chance. Terms for a vocabulary might include
Anthropology/Archaeology
Cultural Artifacts
Library/Literature

  1. "Record": Would users understand something like "Cultural Artifact" or rather "Information Artifact/digital object" under this term? If this refers to a digital object, then it should be removed here and moved as subclass into "Information Artifact" - Digital Objects/DES records would go into "Information Artifacts", together with images, audio/video recordings, etc.?

@jbstatgen
Copy link

This mapping includes the GRSciColl terms.

@smrgeoinfo Could you please change the share settings for the file? Currently I can't access it and might not be the only one.
Thanks a lot, Jutta

@smrgeoinfo
Copy link

Jutta-- sorry! permissions updated, Anyone with link should be able to comment

@Jegelewicz
Copy link
Collaborator

@smrgeoinfo can we just add this to the original file? I'd prefer to just have one.

@smrgeoinfo
Copy link

smrgeoinfo commented May 3, 2022 via email

@albenson-usgs
Copy link

I would like to add saline water, non-saline water?, soil, and sediment but I'm not sure where to add them to the document? They aren't necessarily database uses but I would see them as materialSampleTypes that eDNA collectors would want to use. Should I add them to both the database uses tab and the iSamples mapping tab?

@Jegelewicz
Copy link
Collaborator

I don't think it matters that they aren't currently in use - just add them to the database uses tab.

@Jegelewicz
Copy link
Collaborator

Level 2 within "Biological" will be most informative for many of our use cases. Here I am suggesting

Virology
Microbiology _(Would one want to split Bacteriology from Microbiology? That is, Bacteria, Archaebacteria versus the rest of all those evolutionary dispersed lineages of microorganisms?)_
Mycology
Zoology _(How important is an immediate split into invertebrates - vertebrates?)_
Botany
Paleontology _(human remains go into Anthropology, right?)_
Biomedical _(or any term referring to human biology - and yes, actually this is Zoology)_

But aren't these things really part of identification (with the exception of "Paleontology")? Would we be duplicating whatever is held in dwc:higherClassification?

A list (concatenated and separated) of taxa names terminating at the rank immediately superior to the taxon referenced in the taxon record.

While the terms in the list will not be found exactly in dwc:higherClassification, they can be inferred from there. Or are we to assume that any given dwc:MaterialSample may not have an associated dwc:Identification? If they do, how would this list be more informative than dwc:Identification plus dwc:higherClassification?

@Jegelewicz
Copy link
Collaborator

Some other vocabs to consider
ggbn:materialSampleType - https://rs.gbif.org/extension/ggbn/materialsample.xml

  • Classification of kind of physical sample in addition to BasisOfRecord/RecordBasis and Preparation Type. Please use preparationType for further specification such as "leg","blood","gDNA","axenic culture". Equal to KindOfUnit in ABCD! See also http://terms.tdwg.org/wiki/ggbn:materialSampleType (504 Gateway Time-out)
  • Examples: tissue, culture strain, specimen, DNA, RNA, Protein.

dwc:preparations - https://dwc.tdwg.org/terms/#dwc:preparations

  • A list (concatenated and separated) of preparations and preservation methods for a specimen.

ADBC KindOfUnit - https://terms.tdwg.org › wiki › abcd2:KindOfUnit (504 Gateway Time-out)

  • ?

@Jegelewicz
Copy link
Collaborator

Closing as discussion has now moved to #26 #27 and #28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants