Skip to content

Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq  #5

@jhpoelen

Description

@jhpoelen

as posted in

https://discourse.gbif.org/t/type-specimen-castype1652-found-in-via-filtered-query-https-doi-org-10-15468-dl-xf6ahb-but-not-in-open-access-gbif-data-product-https-doi-org-1

on 2023-03-24


Hi!

First, thanks for providing this open discussion forum in addition to maintaining the expansive biodiversity data-universe that GBIF maintains.

Second, apologies in advance for the long and rather detailed post below.

The executive summary is that I am trying to figure out why I can find type Specimen CASTYPE1652 in filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq .

The text below described how I got to the datasets, and ends with specific questions.

As I am tracking (versioned) digital traces associated with type specimen CASTYPE1652 (see https://beehind.org), I downloaded the open access data product (all :

GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq

via https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip
to produce ~260G of digital content with id hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 .

then, I used a streaming query to count all lines in the "simple" table that was included in the file. In addition, I attempted to filter the data to include only records with collectionCode CASTYPE , the collection code of the collections that keeps the type specimen with catalog number CASTYPE1652 .

After 5h15m processing at a rate of about 100k lines/s , I counted 2.07 billion lines. Also, I found that no records found with collectionCode CASTYPE.

To confirm that the collectionCode CASTYPE was actually used in associated records, and existed on and prior to 1 March 2023, I verified that 1 March 2023 (https://linker.bio/zip:hash://sha256/ffffe616beab7b4a04e46162cdbd2584f986e3f5f5b56258f9737ee31f36b6b6!/occurrence.txt), and 1 January 2023 (https://linker.bio/zip:hash://sha256/110f398aa4c8a4be870c7b3c1d698c32eb2c8dad878b614fe8e8f7a153251a43!/occurrence.txt) of the DarwinCore archive provided by the California Academy of Sciences via http://ipt.calacademy.org:8080/archive.do?r=type included records with collection code CASTYPE.

Also, I logged in to the GBIF web portal and created a "download" with citation:

GBIF.org (24 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.xf6ahb

This download included a filter to only include records associated with GBIF dataset https://www.gbif.org/dataset/6ec3c7f5-6233-48f6-b36a-06b867edbadd associated with the CASTYPE collection.

Using the same methods as earlier, I selected records including mention of collectionCode CASTYPE . Contrary to the earlier results, records with CASTYPE collectionCode now appeared, including CASTYPE1652.

So, given the contradictory results, I was wondering:

  1. Can anybody confirm that CASTYPE records (including CASTYPE1652) do not appear in GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq ?
  2. Can someone explain why the gbif front page claims to have over 2.2 billion records indexed, whereas GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq appears to include about 200M records less ?

Most likely, I don't fully understand what to expect to be included in GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq , so I very much appreciate your insights to better understand these valuable datasets.

Again, apologies for the long and detailed post, and I am curious to hear anyone thoughts on how I should proceed.

thx,
-jorrit

https://jhpoelen.nl

PS. The overarching use case is to document associations between GBIF occurrence identifiers and their associated institution code, collection code, and catalog number. I need this to establish links between CASTYPE1652 (or other specimen) and their digital traces in GBIF , and, indirectly, to Bionomia. Because Bionomia uses gbif identifiers to link people to their associated records, I need to "speak" GBIF identifiers to resolve the wealth of knowledge of the people behind collections as facilitated/enriched by @dshorthouse https://bionomia.net . fyi @Debbie @seltmann

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions