Contributor Guide¶
Setting up development environment¶
Scythe makes use of the Poetry project to manage
dependencies and packaging. To install the latest version of Scythe, first install poetry
following their documentation. Once that’s
done, clone/download the Scythe repository locally from
Github. Change into that directory
and run poetry install
(it would be a good idea to create a new virtual environment for your
project first too, so as to not mix dependencies with your system environment).
By default, only a small subset of extractors will be installed (this is done so that you do not
need to install all the dependencies of extractors you may never use). To install additional
extractors, you can specify “extras” at install time using poetry
. Any of the values specified
in the [tool.poetry.extras]
section of pyproject.toml
can be provided, including all
,
which will install all bundled extractors and their dependencies. For example:
poetry install -E all
Poetry wil create a dedicated virtual environment for the project and the Scythe code will
be installed in “editable” mode, so any changes you make to the code will be reflected when
running tests, importing extractors, etc. It will use the default version of python available.
Scythe is currently developed and tested against Python versions 3.8.12, 3.9.12, and 3.10.4.
We recommend using the pyenv project to manage
various python versions on your system if this does not match your system version of Python. It
is required to use tox
as well (see next paragraph). Make sure you install the versions
specified in the .python-version
file by running commands such as pyenv install 3.8.12
etc.
Additionally, the project uses tox to simplify common tasks and
to be able to run tests in isolated environments. This will be installed automatically as a
development package when running the poetry install
command above. It can be used to run the
test suite with common settings, as well as building the documentation. For example, to
run the full Scythe test suite on all three versions of Python targetd, just run:
poetry run tox
To build the HTML documentation (will be placed inside the ./docs/_build/
folder), run:
poetry run tox -e docs
For the sake of speed, if you would like to focus your testing on just one Python version, you can
temporarily override the environment list from pyproject.toml
with an enviornment variable.
For example, to only run the test/coverage suite on Python 3.8.X, run:
TOXENV=py38 poetry run tox
Check out the [tool.tox]
section of the pyproject.toml
file to view how these tasks are
configured, and the tox documentation on how to add your
own custom tasks, if needed.
Finally, Scythe uses flake8
to enforce code styles, which will be run for you
automatically when using tox
as defined above. Any code-style errors, such as lines longer
than 100 characters, trailing whitespace, etc. will be flagged when running poetry run tox
.
The next part of the Scythe guide details how to add a new extractor to the ecosystem.
Step 1: Implement the Extractor¶
Creating a new extractor is accomplished by implementing the
BaseExtractor abstract class. If you are new to MaterailsIO, we
recommend reviewing the User Guide first to learn about
the available methods of BaseExtractor. Minimally, you need only implement the extract
,
version
, and implementors
operations for a new extractor. Each of these methods (and any
other methods you override) must be stateless, so that running the operation does not change the
behavior of the extractor.
We also have subclasses of BaseExtractor
that are useful for common types of extractors:
BaseSingleFileExtractor
: Extractors that only ever evaluate a single file at a time
Class Attributes and Initializer¶
The BaseExtractor
class supports configuration options as Python class attributes.
These options are intended to define the behavior of an extractor for a particular environment
(e.g., paths of required executables) or for a particular application (e.g., turning off unneeded
features). We recommend limiting these options to be only JSON-serializable data types and for
all to be defined in the __init__
function to simplify text-based configuration files.
The initializer function should check if an extractor has access to all required external tools, and throw exceptions if not. For example, an extractor that relies on calling an external command-line tool should check whether the package is installed. In general, extractors should fail during initialization and not during the parsing operation if the system in misconfigured.
Implementing extract
¶
The extract
method contains the core logic of a Scythe extractor: rendering a summary of a
group of data files. We do not specify any particular schema for the output but we do recommend
best practices:
- Summaries must be JSON-serializable.
Limiting to JSON data types ensures summaries are readable by most software without specia libraries. JSON documents are also able to be documented easily.
- Human-readability is desirable.
JSON summaries should be understandable to users without expert-level knowledge of the data. Avoid unfamiliar acronyms, such as names of variables in a specific simulation code or settings specific to a certain brand of instrument.
- Adhere closely to the original format.
If feasible, try to stay close to the original data format of a file or the output of a library used for parsing. Deviating from already existing formats complicates modifications to an extractor.
- Always return a dictionary.
If an extractor can return multiple records from a single file group, return the list as an element of the dictionary. Any metadata that pertains to each of the sub-records should be stored as a distinct element rather than being duplicated in each sub-record.
We also have a recommendations for the extractor behavior:
- Avoid configuration options that change only output format.
Extractors can take configuration options that alter the output format, but configurations should be used sparingly. A good use of configuration would be to disable complex parsing operations if unneeded. A bad use of configuration would be to change the output to match a different schema. Operations that significantly alter the form but not the content of a summary should be implemented as adaptors.
- Consider whether context should be configuration.
Settings that are identical for each file could be better suited as configuration settings than as context.
Implementing group
¶
The group
operation finds all sets of files in a user-provided list files and directories
that should be parsed together. Implementing group
is optional. Implementing a new group
method is required only when the default behavior of “each file is its own group” (i.e., the
extractor only treats files individually) is incorrect.
The group
operation should not require access to the content of the files or directories to
determine groupings. Being able to determine file groups via only file names improves performance
and allows for determining groups of parsable files without needing to download them from remote
systems.
Files are allowed to appear in more than one group, but we recommend generating only the largest valid group of files to minimize the same metadata being generated multiple times.
It is important to note that that file groups are specific to an extractor. Groupings of files that are meaningful to one extractor need not be meaningful to another. For that reason, limit the definition of groups to sets of files that can be parsed together without consideration to what other information makes the files related (e.g., being in the same directory).
Another appropriate use of the group
operation is to filter out files which are very unlikely
to parse correctly. For example, a PDF extractor could identify only files with a “.pdf” extension.
However, we recommend using filtering sparing to ensure no files are missed.
Implementing citations
and implementors
¶
The citation
and implementors
methods identify additional resources describing an extractor
and provide credit to contributors. implementors
is required, as this operation is also used
to identify points-of-contact for support requests.
citation
should return a list of BibTeX-format references.
implementors
should return a list of people and, optionally, their contract information
in the form: “FirstName LastName <email@provider.com>”.
Implementing version
¶
We require using semantic versioning for specifying the version of extractors.
As the API of the extractor should remain unchanged, use versioning to indicate changes in available
options or the output schema. The version
operation should return the version of the extractor.
Step 2: Document the Extractor¶
The docstring for an extractor must start with a short, one sentence summary of the extractor, which will be used by our autodocumentation tooling. The rest of the documentation should describe what types of files are compatible, what context information can be used, and summarize what types of metadata are generated.
Todo
Actually write these descriptors for the available extractors
The Scythe project uses JSON documents as the output for all extractors and
JSON Schema to describe the content of the documents. The
BaseExtractor class includes a property, schema
, that stores a description of the output format.
We recommend writing your description as a separate file and having the schema
property read
and output the contents of this file. See the
GenericFileExtractor source code
for a example.
Step 3: Register the Extractor¶
Preferred Route: Adding the Extractor to Scythe¶
If your extractor has the same dependencies as existing extractors, add it to the existing module with the same dependencies.
If your extractor has new dependencies, create a new module for your extractor in scythe
, and
then add the requirements as a new key in the [tool.poetry.extras]
section of pyproject
.toml
, following the other extractor examples in that section. Next, add your extractor to
docs/source/extractors.rst
by adding an .. automodule::
statement that refers to your new
module (again, following the existing pattern).
Scythe uses stevedore
to simplify access to the extractors. After implementing and
documenting the extractor, add it to the [tool.poetry.plugins."scythe.extractor"]
section of the
pyproject.toml
file for Scythe. See
stevedore documentation for more information
(these docs reference setup.py
, but the equivalent can be done via plugins in pyproject
.toml
; follow the existing structure if you’re unsure, and ask for help from the developers if
you run into issues).
Alternative Route: Including Extractors from Other Libraries¶
If an extractor would be better suited as part of a different library, you can still register it as a
extractor with Scythe by altering your pyproject.toml
file. Add an entry point with the
namespace "scythe.extractor"
and point to the class object following the
stevedore documentation.
Adding the entry point will let Scythe use your extractor if your library is installed in the
same Python environment as Scythe.
Todo
Provide a public listing of scythe-compatible software.
So that people know where to find these external libraries