Contributor Guide

Setting up development environment

Scythe makes use of the Poetry project to manage dependencies and packaging. To install the latest version of Scythe, first install poetry following their documentation. Once that’s done, clone/download the Scythe repository locally from Github. Change into that directory and run poetry install (it would be a good idea to create a new virtual environment for your project first too, so as to not mix dependencies with your system environment).

By default, only a small subset of extractors will be installed (this is done so that you do not need to install all the dependencies of extractors you may never use). To install additional extractors, you can specify “extras” at install time using poetry. Any of the values specified in the [tool.poetry.extras] section of pyproject.toml can be provided, including all, which will install all bundled extractors and their dependencies. For example:

poetry install -E all

Poetry wil create a dedicated virtual environment for the project and the Scythe code will be installed in “editable” mode, so any changes you make to the code will be reflected when running tests, importing extractors, etc. It will use the default version of python available. Scythe is currently developed and tested against Python versions 3.8.12, 3.9.12, and 3.10.4. We recommend using the pyenv project to manage various python versions on your system if this does not match your system version of Python. It is required to use tox as well (see next paragraph). Make sure you install the versions specified in the .python-version file by running commands such as pyenv install 3.8.12 etc.

Additionally, the project uses tox to simplify common tasks and to be able to run tests in isolated environments. This will be installed automatically as a development package when running the poetry install command above. It can be used to run the test suite with common settings, as well as building the documentation. For example, to run the full Scythe test suite on all three versions of Python targetd, just run:

poetry run tox

To build the HTML documentation (will be placed inside the ./docs/_build/ folder), run:

poetry run tox -e docs

For the sake of speed, if you would like to focus your testing on just one Python version, you can temporarily override the environment list from pyproject.toml with an enviornment variable. For example, to only run the test/coverage suite on Python 3.8.X, run:

TOXENV=py38 poetry run tox

Check out the [tool.tox] section of the pyproject.toml file to view how these tasks are configured, and the tox documentation on how to add your own custom tasks, if needed.

Finally, Scythe uses flake8 to enforce code styles, which will be run for you automatically when using tox as defined above. Any code-style errors, such as lines longer than 100 characters, trailing whitespace, etc. will be flagged when running poetry run tox.

The next part of the Scythe guide details how to add a new extractor to the ecosystem.

Step 1: Implement the Extractor

Creating a new extractor is accomplished by implementing the BaseExtractor abstract class. If you are new to MaterailsIO, we recommend reviewing the User Guide first to learn about the available methods of BaseExtractor. Minimally, you need only implement the extract, version, and implementors operations for a new extractor. Each of these methods (and any other methods you override) must be stateless, so that running the operation does not change the behavior of the extractor.

We also have subclasses of BaseExtractor that are useful for common types of extractors:

  • BaseSingleFileExtractor: Extractors that only ever evaluate a single file at a time

Class Attributes and Initializer

The BaseExtractor class supports configuration options as Python class attributes. These options are intended to define the behavior of an extractor for a particular environment (e.g., paths of required executables) or for a particular application (e.g., turning off unneeded features). We recommend limiting these options to be only JSON-serializable data types and for all to be defined in the __init__ function to simplify text-based configuration files.

The initializer function should check if an extractor has access to all required external tools, and throw exceptions if not. For example, an extractor that relies on calling an external command-line tool should check whether the package is installed. In general, extractors should fail during initialization and not during the parsing operation if the system in misconfigured.

Implementing extract

The extract method contains the core logic of a Scythe extractor: rendering a summary of a group of data files. We do not specify any particular schema for the output but we do recommend best practices:

  1. Summaries must be JSON-serializable.

    Limiting to JSON data types ensures summaries are readable by most software without specia libraries. JSON documents are also able to be documented easily.

  2. Human-readability is desirable.

    JSON summaries should be understandable to users without expert-level knowledge of the data. Avoid unfamiliar acronyms, such as names of variables in a specific simulation code or settings specific to a certain brand of instrument.

  3. Adhere closely to the original format.

    If feasible, try to stay close to the original data format of a file or the output of a library used for parsing. Deviating from already existing formats complicates modifications to an extractor.

  4. Always return a dictionary.

    If an extractor can return multiple records from a single file group, return the list as an element of the dictionary. Any metadata that pertains to each of the sub-records should be stored as a distinct element rather than being duplicated in each sub-record.

We also have a recommendations for the extractor behavior:

  1. Avoid configuration options that change only output format.

    Extractors can take configuration options that alter the output format, but configurations should be used sparingly. A good use of configuration would be to disable complex parsing operations if unneeded. A bad use of configuration would be to change the output to match a different schema. Operations that significantly alter the form but not the content of a summary should be implemented as adaptors.

  2. Consider whether context should be configuration.

    Settings that are identical for each file could be better suited as configuration settings than as context.

Implementing group

The group operation finds all sets of files in a user-provided list files and directories that should be parsed together. Implementing group is optional. Implementing a new group method is required only when the default behavior of “each file is its own group” (i.e., the extractor only treats files individually) is incorrect.

The group operation should not require access to the content of the files or directories to determine groupings. Being able to determine file groups via only file names improves performance and allows for determining groups of parsable files without needing to download them from remote systems.

Files are allowed to appear in more than one group, but we recommend generating only the largest valid group of files to minimize the same metadata being generated multiple times.

It is important to note that that file groups are specific to an extractor. Groupings of files that are meaningful to one extractor need not be meaningful to another. For that reason, limit the definition of groups to sets of files that can be parsed together without consideration to what other information makes the files related (e.g., being in the same directory).

Another appropriate use of the group operation is to filter out files which are very unlikely to parse correctly. For example, a PDF extractor could identify only files with a “.pdf” extension. However, we recommend using filtering sparing to ensure no files are missed.

Implementing citations and implementors

The citation and implementors methods identify additional resources describing an extractor and provide credit to contributors. implementors is required, as this operation is also used to identify points-of-contact for support requests.

citation should return a list of BibTeX-format references.

implementors should return a list of people and, optionally, their contract information in the form: “FirstName LastName <email@provider.com>”.

Implementing version

We require using semantic versioning for specifying the version of extractors. As the API of the extractor should remain unchanged, use versioning to indicate changes in available options or the output schema. The version operation should return the version of the extractor.

Step 2: Document the Extractor

The docstring for an extractor must start with a short, one sentence summary of the extractor, which will be used by our autodocumentation tooling. The rest of the documentation should describe what types of files are compatible, what context information can be used, and summarize what types of metadata are generated.

Todo

Actually write these descriptors for the available extractors

The Scythe project uses JSON documents as the output for all extractors and JSON Schema to describe the content of the documents. The BaseExtractor class includes a property, schema, that stores a description of the output format. We recommend writing your description as a separate file and having the schema property read and output the contents of this file. See the GenericFileExtractor source code for a example.

Step 3: Register the Extractor

Preferred Route: Adding the Extractor to Scythe

If your extractor has the same dependencies as existing extractors, add it to the existing module with the same dependencies.

If your extractor has new dependencies, create a new module for your extractor in scythe, and then add the requirements as a new key in the [tool.poetry.extras] section of pyproject .toml, following the other extractor examples in that section. Next, add your extractor to docs/source/extractors.rst by adding an .. automodule:: statement that refers to your new module (again, following the existing pattern).

Scythe uses stevedore to simplify access to the extractors. After implementing and documenting the extractor, add it to the [tool.poetry.plugins."scythe.extractor"] section of the pyproject.toml file for Scythe. See stevedore documentation for more information (these docs reference setup.py, but the equivalent can be done via plugins in pyproject .toml; follow the existing structure if you’re unsure, and ask for help from the developers if you run into issues).

Alternative Route: Including Extractors from Other Libraries

If an extractor would be better suited as part of a different library, you can still register it as a extractor with Scythe by altering your pyproject.toml file. Add an entry point with the namespace "scythe.extractor" and point to the class object following the stevedore documentation. Adding the entry point will let Scythe use your extractor if your library is installed in the same Python environment as Scythe.

Todo

Provide a public listing of scythe-compatible software.

So that people know where to find these external libraries