scythe

Documentation for the non-parser functions in scythe.

scythe.adapters.base

Base classes for adapters

class scythe.adapters.base.BaseAdapter[source]

Template for tools that transform metadata into a new form

check_compatibility(parser: BaseExtractor) bool[source]

Evaluate whether an adapter is compatible with a certain parser

Parameters:

parser (BaseExtractor) – Parser to evaluate

Returns:

(bool) Whether this parser is compatible

abstract transform(metadata: dict, context: None | dict = None) Any[source]

Process metadata into a new form

Parameters:
  • metadata (dict) – Metadata to transform

  • context (dict) – Any context information used during transformation

Returns:

Metadata in a new form, can be any type of object. None corresponding

version() None | str[source]

Version of the parser that an adapter was created for

Returns:

(str) Version of parser this adapter was designed for,

or None if not applicable

class scythe.adapters.base.GreedySerializeAdapter[source]

Converts the metadata to a string by serializing with JSON, making some (hopefully) informed choices about what to do with various types commonly seen, and otherwise reporting that the data type could not be serialized. May not work in all situations, but should cover a large number of cases.

transform(metadata: dict, context=None) str[source]

Process metadata into a new form

Parameters:
  • metadata (dict) – Metadata to transform

  • context (dict) – Any context information used during transformation

Returns:

Metadata in a new form, can be any type of object. None corresponding

class scythe.adapters.base.NOOPAdapter[source]

Adapter that does not alter the output data

Used for testing purposes

transform(metadata: dict, context=None) dict[source]

Process metadata into a new form

Parameters:
  • metadata (dict) – Metadata to transform

  • context (dict) – Any context information used during transformation

Returns:

Metadata in a new form, can be any type of object. None corresponding

class scythe.adapters.base.SerializeAdapter[source]

Converts the metadata to a string by serializing with JSON

transform(metadata: dict, context=None) str[source]

Process metadata into a new form

Parameters:
  • metadata (dict) – Metadata to transform

  • context (dict) – Any context information used during transformation

Returns:

Metadata in a new form, can be any type of object. None corresponding

scythe.utils.interface

Utilities for working with extractors from other applications

class scythe.utils.interface.ExtractResult(group, extractor, metadata)

Create new instance of ExtractResult(group, extractor, metadata)

extractor

Alias for field number 1

group

Alias for field number 0

metadata

Alias for field number 2

scythe.utils.interface.get_adapter(name: str) BaseAdapter[source]

Load an adapter

Parameters:

name (str) – Name of adapter

Returns:

(BaseAdapter) Requested adapter

scythe.utils.interface.get_available_adapters() dict[source]

Get information on all available adapters

Returns:

(dict) Where keys are adapter names and values are descriptions

scythe.utils.interface.get_available_extractors()[source]

Get information about the available extractors

Returns:

Descriptions of available extractors

Return type:

[dict]

scythe.utils.interface.get_extractor(name: str) BaseExtractor[source]

Load an extractor object

Parameters:

name (str) – Name of extractor

Returns:

Requested extractor

scythe.utils.interface.get_extractor_and_adapter_contexts(name, global_context, extractor_context, adapter_context)[source]
Helper function to update the helper and adapter contexts and the ‘name’

of a extractor/adapter pair

Parameters:
  • name (str) – adapter/extractor name.

  • global_context (dict) – Context of the files, used for every extractor and adapter

  • adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key @all is used to for context used for every adapter

  • extractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key @all is used to for context used for every extractor

Returns:

extractor_context, my_adapter context tuple

Return type:

(dict, dict)

scythe.utils.interface.run_all_extractors_on_directory(directory: str, global_context=None, adapter_context: None | dict = None, extractor_context: None | dict = None, include_extractors: None | List[str] = None, exclude_extractors: None | List = None, adapter_map: None | str | Dict[str, str] = None, default_adapter: None | str = None) Iterator[ExtractResult][source]

Run all known files on a directory of files

Parameters:
  • directory (str) – Path to directory to be parsed

  • global_context (dict) – Context of the files, used for every extractor and adapter

  • adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key @all is used to for context used for every adapter

  • extractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key @all is used to for context used for every extractor

  • include_extractors ([str]) – Predefined list of extractors to run. Only these will be used. Mutually exclusive with exclude_extractors.

  • exclude_extractors ([str]) – List of extractors to exclude. Mutually exclusive with include_extractors.

  • adapter_map (str, dict) – Map of extractor name to the desired adapter. Use ‘match’ to find adapters with the same names

  • default_adapter (str) – Adapter to use if no other adapter is defined

Yields

((str), str, dict) Tuple of (1) group of files, (2) name of extractor, (3) metadata

scythe.utils.interface.run_all_extractors_on_group(group, adapter_map=None, global_context=None, adapter_context: None | dict = None, extractor_context: None | dict = None, include_extractors: None | List[str] = None, exclude_extractors: None | List = None, default_adapter: None | str = None)[source]

Parse metadata from a file-group and adapt its metadata per a user-supplied adapter_map.

This function is effectively a wrapper to execute_extractor() that enables us to output metadata in the same format as run_all_extractors_on_directory(), but just on a single file group.

Parameters:
  • group ([str]) – Paths to group of files to be parsed

  • global_context (dict) – Context of the files, used for every extractor and adapter

  • adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key @all is used to for context used for every adapter

  • extractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key @all is used to for context used for every extractor

  • include_extractors ([str]) – Predefined list of extractors to run. Only these will be used. Mutually exclusive with exclude_extractors.

  • exclude_extractors ([str]) – List of extractors to exclude. Mutually exclusive with include_extractors.

  • adapter_map (str, dict) – Map of extractor name to the desired adapter. Use ‘match’ to find adapters with the same names:

  • default_adapter

Yields:

Metadata for a certain

scythe.utils.interface.run_extractor(name, group, context=None, adapter=None)[source]

Invoke a extractor on a certain group of files

Parameters:
  • name (str) – Name of the extractor

  • group ([str]) – Paths to group of files to be parsed

  • context (dict) – Context of the files, used in adapter and extractor

  • adapter (str) – Name of adapter to use to transform metadata

Returns:

Metadata generated by the extractor

Return type:

([dict])

scythe.utils.grouping

Utilities for implementing grouping operations

scythe.utils.grouping.group_by_postfix(files: Iterable[str], vocabulary: List[str]) Iterable[Tuple[str, ...]][source]

Group files that have a common ending

Finds all filenames that begin with a prefixes from a user-provided vocabulary and end with the same post-fix.

For example, consider a directory that contains files A.1, B.1, A.2, B.2, and C.1. If a user provides a vocabulary of [‘A’, ‘B’], the parser will return groups (A.1, B.1) and (A.2, B.2). If a user provides a vocabulary of [‘A’, ‘B’, ‘C’], the parser will return groups (A.1, B.1), (A.2, B.2), and (C.1)

See scythe.dft.DFTParser for an example usage.

Parameters:
  • files ([str]) – List of files to be grouped

  • vocabulary ([str]) – List of known starts for the file

Yields:

([str]) – Groups of files to be parsed together

scythe.utils.grouping.preprocess_paths(paths: str | Path | List[str] | List[Path]) List[str][source]

Transform paths to absolute paths

Designed to be used to simplify grouping logic

Parameters:

paths (Union[str, List[str]) – Files and directories to be parsed

Returns:

List of paths in standardized form

Return type:

(List[str])