pmc

PMC track

This track contains data retrieved from PMC.

Note that we have filtered the data set so only a subset of the articles is included.

Example Document

Note that the body content is actually much longer has been shortened here to increase readability.

{
  "name": "3_Biotech_2015_Dec_13_5(6)_1007-1019",
  "journal": "3 Biotech",
  "date": "2015 Dec 13",
  "volume": "5(6)",
  "issue": "1007-1019",
  "accession": "PMC4624133",
  "timestamp": "2015-10-30 20:08:11",
  "pmid": "",
  "body": "\n==== Front\n3 Biotech3 Biotech3 Biotech2190-572X2190-5738Springer ..."
}

Parameters

This track allows to overwrite the following parameters with Rally 0.8.0+ using --track-params:

bulk_size (default: 500)
bulk_indexing_clients (default: 8): Number of clients that issue bulk indexing requests.
ingest_percentage (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.
conflicts (default: "random"): Type of id conflicts to simulate. Valid values are: 'sequential' (A document id is replaced with a document id with a sequentially increasing id), 'random' (A document id is replaced with a document id with a random other id).
conflict_probability (default: 25): A number between 0 and 100 that defines the probability of id conflicts. This requires to run the respective challenge. Combining conflicts=sequential and conflict-probability=0 makes Rally generate index ids by itself, instead of relying on Elasticsearch's automatic id generation <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#_automatic_id_generation>_.
on_conflict (default: "index"): Whether to use an "index" or an "update" action when simulating an id conflict.
recency (default: 0): A number between 0 and 1 that defines whether to bias towards more recent ids when simulating conflicts. See the Rally docs for the full definition of this parameter. This requires to run the respective challenge.
max_num_segments: The number of segments to target when doing a force merge (default: -1)
number_of_replicas (default: 0)
number_of_shards (default: 5)
source_enabled (default: true): A boolean defining whether the _source field is stored in the index.
index_settings: A list of index settings. Index settings defined elsewhere (e.g. number_of_replicas) need to be overridden explicitly.
default_search_timeout (default: -1)
cluster_health (default: "green"): The minimum required cluster health.
error_level (default: "non-fatal"): Available for bulk operations only to specify ignore-response-error-level.
post_ingest_sleep (default: false): Whether to pause after ingest and prior to subsequent operations.
post_ingest_sleep_duration (default: 30): Sleep duration in seconds.

License

All articles that are included are licensed as CC-BY (http://creativecommons.org/licenses/by/2.0/)

This data set is licensed under the same terms. Please refer to http://creativecommons.org/licenses/by/2.0/ for details.

Attribution hint:

You can download a full list of the author information for each included document from http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/pmc/attribution.txt.bz2 (size: 52.2MB)

Name		Name	Last commit message	Last commit date
parent directory ..
challenges		challenges
operations		operations
README.md		README.md
files.txt		files.txt
index.json		index.json
track.json		track.json
track.py		track.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

PMC track

Example Document

Parameters

License

FilesExpand file tree

pmc

Directory actions

More options

Directory actions

More options

Latest commit

History

pmc

Folders and files

parent directory

README.md

PMC track

Example Document

Parameters

License