HPI/doc/CONFIGURING.org
2020-05-10 12:05:36 +01:00

6.6 KiB

I feel like it's good to keep the rationales in the documentation, but happy to discuss it here.

Before discussing the abstract matters, let's consider a specific situation. Say, we want to let the user configure bluemaestro module. At the moment, it uses the following config attributes:

  • export_path Path to the data, this is obviously a required attribute
  • cache_path Cache is extremely useful to speed up some queries. But it's optional, everything should work without it.

I'll refer to this config as specific further in the doc, and give examples. to each point. Note that they are only illustrating the specific requirement, potentially ignoring the other ones. Now, the requirements as I see it:

  1. configuration should be extremely flexible

    We need to make sure it's very easy to combine/filter/extend data without having to modify and rewrite the module code. This means using a powerful language for config, and realistically, a Turing complete.

    General: that means that you should be able to use powerful, potentially running arbitrary code if this is something

    Specific: we've got Python already, so it makes a lot of sense to use it!

    class bluemaestro:
        export_path = '/path/to/bluemaestro'
        cache_path  = '/tmp/bluemaestro.cache'

    Downsides:

    • keeping it Turing complete means it's potentially less accessible to people less familiar with programming But see the further point about keeping it simple. I claim that simple programs look as easy as simple json.
    • Python is 'less safe' than a plain json/yaml config But at the moment the whole thing is running potentially untrusted Python code anyway. It's not a tool you're going to install it across your organization, run under root privileges, and let the employers tweak it. Ultimately, you set it up for yourself, and the config has exactly the same permissions as the code you're installing. Thinking that plain config would give you more security is deceptive, and it's a false sense of security (at this stage of the project).

    I also write more about all this here.

  2. configuration should be backwards compatible

    General: the whole system is pretty chaotic, it's hard to control the versioning of different modules and their compatibility. It's important to allow changing attribute names and adding new functionality, while making sure the module works against an older version of the config. Ideally warn the user that they'd better migrate to a newer version if the fallbacks are triggered. Potentially: use individual versions for modules? Although it makes things a bit complicated.

    Specific: say the module is using a new config attribute, timezone. We would need to adapt the module to support the old configs without timezone. For example, in bluemaestro.py (pseudocode):

    user_config = load_user_config()
    if not hasattr(user_config, 'timezone'):
        warnings.warn("Please specify 'timezone' in the config! Falling back to the system timezone.")
        user_config.timezonee = get_system_timezone()

    This is possible to achieve with pretty much any config format, just important to keep in mind.

  3. configuration should be as easy to write as possible

    General: as lean and non-verbose as possible. No extra imports, no extra inheritance, annotations, etc.

    Specific: the user only has to specify export_path to make the module function and that's it. For example:

    {
         'export_path': '/path/to/bluemaestro/'
    }

    It's possible to achieve with any configuration format (aided by some helpers to fill in optional attributes etc), so it's more of a guiding principle.

  4. configuration should be as easy to use and extend as possible

    General: enable the users to add new config attributes and immediately use them without any hassle and boilerplate. It's easy to achieve on it's own, but harder to achieve simultaneously with (2).

    Specific: if you keep the config as Python, simply importing the config in the module satisfies this property:

    from my.config import bluemaestro as user_config

    If the config is in JSON or something, it's possible to load it dynamically too without the boilerplate.

  5. configuration should have checks

    General: make sure it's easy to track down configuration errors. At least runtime checks for required attributes, their types, warnings, that sort of thing. But a biggie for me is using mypy to statically typecheck the modules. To some extent it gets in the way of (2) and (4).

    Specific: using NamedTuple/dataclass has capabilities to verify the config with no extra boilerplate on the user side.

    class bluemaestro(NamedTuple):
         export_path: str
         cache_path : Optional[str] = None
    
    raw_config = json.load('configs/bluemaestro.json')
    config = bluemaestro(**raw_config)

    This will fail if required export_path is missing, and fill optional cache_path with None. In addition, it's mypy friendly.

  6. configuration should be easy to document General: ideally, it should be autogenerated, be self-descriptive and have some sort of schema, to make sure the documentation (which no one likes to write) doesn't diverge. Specific: mypy annotations seem like the way to go. I did some experiments with using Protocol or a NamedTuple for a self-descriptive my.reddit configuration. See the example from (5), it's pretty clear from the code what needs to be in the config.