scythe¶
Documentation for the non-parser functions in scythe
.
scythe.adapters.base¶
Base classes for adapters
- class scythe.adapters.base.BaseAdapter[source]¶
Template for tools that transform metadata into a new form
- check_compatibility(parser: BaseExtractor) bool [source]¶
Evaluate whether an adapter is compatible with a certain parser
- Parameters:
parser (BaseExtractor) – Parser to evaluate
- Returns:
(bool) Whether this parser is compatible
- class scythe.adapters.base.GreedySerializeAdapter[source]¶
Converts the metadata to a string by serializing with JSON, making some (hopefully) informed choices about what to do with various types commonly seen, and otherwise reporting that the data type could not be serialized. May not work in all situations, but should cover a large number of cases.
- class scythe.adapters.base.NOOPAdapter[source]¶
Adapter that does not alter the output data
Used for testing purposes
scythe.utils.interface¶
Utilities for working with extractors from other applications
- class scythe.utils.interface.ExtractResult(group, extractor, metadata)¶
Create new instance of ExtractResult(group, extractor, metadata)
- extractor¶
Alias for field number 1
- group¶
Alias for field number 0
- metadata¶
Alias for field number 2
- scythe.utils.interface.get_adapter(name: str) BaseAdapter [source]¶
Load an adapter
- Parameters:
name (str) – Name of adapter
- Returns:
(BaseAdapter) Requested adapter
- scythe.utils.interface.get_available_adapters() dict [source]¶
Get information on all available adapters
- Returns:
(dict) Where keys are adapter names and values are descriptions
- scythe.utils.interface.get_available_extractors()[source]¶
Get information about the available extractors
- Returns:
Descriptions of available extractors
- Return type:
[dict]
- scythe.utils.interface.get_extractor(name: str) BaseExtractor [source]¶
Load an extractor object
- Parameters:
name (str) – Name of extractor
- Returns:
Requested extractor
- scythe.utils.interface.get_extractor_and_adapter_contexts(name, global_context, extractor_context, adapter_context)[source]¶
- Helper function to update the helper and adapter contexts and the ‘name’
of a extractor/adapter pair
- Parameters:
name (str) – adapter/extractor name.
global_context (dict) – Context of the files, used for every extractor and adapter
adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key
@all
is used to for context used for every adapterextractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key
@all
is used to for context used for every extractor
- Returns:
extractor_context, my_adapter context tuple
- Return type:
- scythe.utils.interface.run_all_extractors_on_directory(directory: str, global_context=None, adapter_context: None | dict = None, extractor_context: None | dict = None, include_extractors: None | List[str] = None, exclude_extractors: None | List = None, adapter_map: None | str | Dict[str, str] = None, default_adapter: None | str = None) Iterator[ExtractResult] [source]¶
Run all known files on a directory of files
- Parameters:
directory (str) – Path to directory to be parsed
global_context (dict) – Context of the files, used for every extractor and adapter
adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key
@all
is used to for context used for every adapterextractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key
@all
is used to for context used for every extractorinclude_extractors ([str]) – Predefined list of extractors to run. Only these will be used. Mutually exclusive with exclude_extractors.
exclude_extractors ([str]) – List of extractors to exclude. Mutually exclusive with include_extractors.
adapter_map (str, dict) – Map of extractor name to the desired adapter. Use ‘match’ to find adapters with the same names
default_adapter (str) – Adapter to use if no other adapter is defined
- Yields
((str), str, dict) Tuple of (1) group of files, (2) name of extractor, (3) metadata
- scythe.utils.interface.run_all_extractors_on_group(group, adapter_map=None, global_context=None, adapter_context: None | dict = None, extractor_context: None | dict = None, include_extractors: None | List[str] = None, exclude_extractors: None | List = None, default_adapter: None | str = None)[source]¶
Parse metadata from a file-group and adapt its metadata per a user-supplied adapter_map.
This function is effectively a wrapper to execute_extractor() that enables us to output metadata in the same format as run_all_extractors_on_directory(), but just on a single file group.
- Parameters:
group ([str]) – Paths to group of files to be parsed
global_context (dict) – Context of the files, used for every extractor and adapter
adapter_context (dict) – Context used for adapters. Key is the name of the adapter, value is the context. The key
@all
is used to for context used for every adapterextractor_context (dict) – Context used for adapters. Key is the name of the extractor, value is the context. The key
@all
is used to for context used for every extractorinclude_extractors ([str]) – Predefined list of extractors to run. Only these will be used. Mutually exclusive with exclude_extractors.
exclude_extractors ([str]) – List of extractors to exclude. Mutually exclusive with include_extractors.
adapter_map (str, dict) – Map of extractor name to the desired adapter. Use ‘match’ to find adapters with the same names:
default_adapter –
- Yields:
Metadata for a certain
scythe.utils.grouping¶
Utilities for implementing grouping operations
- scythe.utils.grouping.group_by_postfix(files: Iterable[str], vocabulary: List[str]) Iterable[Tuple[str, ...]] [source]¶
Group files that have a common ending
Finds all filenames that begin with a prefixes from a user-provided vocabulary and end with the same post-fix.
For example, consider a directory that contains files A.1, B.1, A.2, B.2, and C.1. If a user provides a vocabulary of [‘A’, ‘B’], the parser will return groups (A.1, B.1) and (A.2, B.2). If a user provides a vocabulary of [‘A’, ‘B’, ‘C’], the parser will return groups (A.1, B.1), (A.2, B.2), and (C.1)
See
scythe.dft.DFTParser
for an example usage.