Available Extractors

These pages detail all of the extractors currently available in Scythe.

Quick Summary

The extractors that are configured to work with the stevedore plugin are:

  • ase – Parse information from atomistic simulation input files using ASE.

  • crystal – Extract information about a crystal structure from many types of files.

  • csv – Describe the contents of a comma-separated value (CSV) file

  • dft – Extract metadata from Density Functional Theory calculation results

  • em – Extract metadata specific to electron microscopy.

  • filename – Extracts metadata in a filename, according to user-supplied patterns.

  • generic – Gather basic file information

  • image – Retrieves basic information about an image

  • json – Extracts fields in JSON into a user-defined new schema.

  • noop – Determine whether files exist, used for debugging

  • tdb – Extract metadata from a Thermodynamic Database (TBD) file.

  • xml – Extracts fields in XML into a user-defined new schema in JSON.

  • yaml – Extracts fields in YAML into a user-defined new schema in JSON.

Detailed Listing

Generic File Extractors

Extractors that work for any kind of file

class scythe.file.GenericFileExtractor(store_path=True, compute_hash=True)[source]

Gather basic file information

Parameters:
  • store_path (bool) – Whether to record the path of the file

  • compute_hash (bool) – Whether to compute the hash of a file

Image Extractors

Extractors that read image data

class scythe.image.ImageExtractor[source]

Retrieves basic information about an image

Electron Microscopy Extractors

Extractors that read electron microscopy data of various sorts (images, spectra, spectrum images, etc.) using the HyperSpy package.

class scythe.electron_microscopy.ElectronMicroscopyExtractor[source]

Extract metadata specific to electron microscopy.

This parser handles any file supported by HyperSpy’s I/O capabilities. Extract both the metadata interpreted by HyperSpy directly, but also any important values we can pick out manually.

For each value (if it is known), return a subdict with two keys: value, containing the actual value of the metadata parameter, and unit, a string containing a unit name from the QUDT vocabulary. Including a unit is optional, but highly recommended, if it is known.

The allowed metadata values are controlled by the JSONSchema specification in the schemas/electron_microscopy.json file.

Atomistic Data Extractors

Extractors related to data files that encode atom-level structure

class scythe.crystal_structure.CrystalStructureExtractor[source]

Extract information about a crystal structure from many types of files.

Uses either ASE or Pymatgen on the back end

class scythe.ase.ASEExtractor[source]

Parse information from atomistic simulation input files using ASE.

ASE can read many file types. These can be found at https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Metadata are generated as ASE JSON DB format: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

scythe.ase.object_hook(dct)[source]

Custom decoder for ASE JSON objects

Does everything except reconstitute the JSON object and also converts numpy arrays to lists

Adapted from ase.io.jsonio

Parameters:

dct (dict) – Dictionary to reconstitute to an ASE object

Calculation Extractors

Extractors that retrieve results from calculations

class scythe.dft.DFTExtractor(quality_report=False)[source]

Extract metadata from Density Functional Theory calculation results

Uses the dfttopif parser to extract metadata from each file

Initialize the extractor

Parameters:

quality_report (bool) – Whether to generate a quality report

extract(group: Iterable[str], context: dict | None = None)[source]

Extract metadata from a group of files

A group of files is a set of 1 or more files that describe the same object and will be used together to create a single metadata record.

Parameters:
  • group ([str]) – A list of one or more files that should be parsed together

  • context (dict) – Context about the files

Returns:

The parsed results, in JSON-serializable format.

Return type:

(dict)

class scythe.ase.ASEExtractor[source]

Parse information from atomistic simulation input files using ASE.

ASE can read many file types. These can be found at https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Metadata are generated as ASE JSON DB format: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

scythe.ase.object_hook(dct)[source]

Custom decoder for ASE JSON objects

Does everything except reconstitute the JSON object and also converts numpy arrays to lists

Adapted from ase.io.jsonio

Parameters:

dct (dict) – Dictionary to reconstitute to an ASE object

Structured Data Files

Extractors that read data from structured files

class scythe.csv.CSVExtractor(return_records=True, **kwargs)[source]

Describe the contents of a comma-separated value (CSV) file

The context dictionary for the CSV parser includes several fields:
  • schema: Dictionary defining the schema for this dataset, following that of FrictionlessIO

  • na_values: Any values that should be interpreted as missing

Parameters:

return_records (bool) – Whether to return each row in the CSV file

Keyword:

All kwargs as passed to TableSchema’s infer method

citations() List[str][source]

Citation(s) and reference(s) for this extractor

Returns:

each element should be a string citation in BibTeX format

Return type:

([str])