Available Extractors¶

These pages detail all of the extractors currently available in Scythe.

Quick Summary¶

The extractors that are configured to work with the stevedore plugin are:

ase – Parse information from atomistic simulation input files using ASE.
crystal – Extract information about a crystal structure from many types of files.
csv – Describe the contents of a comma-separated value (CSV) file
dft – Extract metadata from Density Functional Theory calculation results
em – Extract metadata specific to electron microscopy.
filename – Extracts metadata in a filename, according to user-supplied patterns.
generic – Gather basic file information
image – Retrieves basic information about an image
json – Extracts fields in JSON into a user-defined new schema.
noop – Determine whether files exist, used for debugging
tdb – Extract metadata from a Thermodynamic Database (TBD) file.
xml – Extracts fields in XML into a user-defined new schema in JSON.
yaml – Extracts fields in YAML into a user-defined new schema in JSON.

Detailed Listing¶

Generic File Extractors¶

Extractors that work for any kind of file

class scythe.file.GenericFileExtractor(store_path=True, compute_hash=True)[source]¶

Gather basic file information

Parameters:

store_path (bool) – Whether to record the path of the file
compute_hash (bool) – Whether to compute the hash of a file

Image Extractors¶

Extractors that read image data

class scythe.image.ImageExtractor[source]¶: Retrieves basic information about an image

Electron Microscopy Extractors¶

Extractors that read electron microscopy data of various sorts (images, spectra, spectrum images, etc.) using the HyperSpy package.

class scythe.electron_microscopy.ElectronMicroscopyExtractor[source]¶

Extract metadata specific to electron microscopy.

This parser handles any file supported by HyperSpy’s I/O capabilities. Extract both the metadata interpreted by HyperSpy directly, but also any important values we can pick out manually.

For each value (if it is known), return a subdict with two keys: value, containing the actual value of the metadata parameter, and unit, a string containing a unit name from the QUDT vocabulary. Including a unit is optional, but highly recommended, if it is known.

The allowed metadata values are controlled by the JSONSchema specification in the schemas/electron_microscopy.json file.

Atomistic Data Extractors¶

Extractors related to data files that encode atom-level structure

class scythe.crystal_structure.CrystalStructureExtractor[source]¶

Extract information about a crystal structure from many types of files.

Uses either ASE or Pymatgen on the back end

class scythe.ase.ASEExtractor[source]

Parse information from atomistic simulation input files using ASE.

ASE can read many file types. These can be found at https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Metadata are generated as ASE JSON DB format: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

scythe.ase.object_hook(dct)[source]

Custom decoder for ASE JSON objects

Does everything except reconstitute the JSON object and also converts numpy arrays to lists

Adapted from ase.io.jsonio

Parameters:: dct (dict) – Dictionary to reconstitute to an ASE object

Calculation Extractors¶

Extractors that retrieve results from calculations

class scythe.dft.DFTExtractor(quality_report=False)[source]¶

Extract metadata from Density Functional Theory calculation results

Uses the dfttopif parser to extract metadata from each file

Initialize the extractor

Parameters:: quality_report (bool) – Whether to generate a quality report

extract(group: Iterable[str], context: dict | None = None)[source]¶

Extract metadata from a group of files

A group of files is a set of 1 or more files that describe the same object and will be used together to create a single metadata record.

Parameters:

group ([str]) – A list of one or more files that should be parsed together
context (dict) – Context about the files

Returns:

The parsed results, in JSON-serializable format.

Return type:

(dict)

class scythe.ase.ASEExtractor[source]

Parse information from atomistic simulation input files using ASE.

ASE can read many file types. These can be found at https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Metadata are generated as ASE JSON DB format: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

scythe.ase.object_hook(dct)[source]

Custom decoder for ASE JSON objects

Does everything except reconstitute the JSON object and also converts numpy arrays to lists

Adapted from ase.io.jsonio

Parameters:: dct (dict) – Dictionary to reconstitute to an ASE object

Structured Data Files¶

Extractors that read data from structured files

class scythe.csv.CSVExtractor(return_records=True, **kwargs)[source]¶

Describe the contents of a comma-separated value (CSV) file

The context dictionary for the CSV parser includes several fields:

schema: Dictionary defining the schema for this dataset, following that of FrictionlessIO
na_values: Any values that should be interpreted as missing

Parameters:: return_records (bool) – Whether to return each row in the CSV file

Keyword:: All kwargs as passed to TableSchema’s infer method

citations() → List[str][source]¶

Citation(s) and reference(s) for this extractor

Returns:: each element should be a string citation in BibTeX format
Return type:: ([str])