306 lines
14 KiB
Org Mode
306 lines
14 KiB
Org Mode
This doc describes the technical decisions behind HPI configuration system.
|
|
It's more of a 'design doc' rather than usage guide.
|
|
If you just want to know how to set up HPI or configure it, see [[file:SETUP.org][SETUP]].
|
|
|
|
I feel like it's good to keep the rationales in the documentation,
|
|
but happy to [[https://github.com/karlicoss/HPI/issues/46][discuss]] it here.
|
|
|
|
Before discussing the abstract matters, let's consider a specific situation.
|
|
Say, we want to let the user configure [[https://github.com/karlicoss/HPI/blob/master/my/bluemaestro/__init__.py][bluemaestro]] module.
|
|
At the moment, it uses the following config attributes:
|
|
|
|
- ~export_path~
|
|
|
|
Path to the data, this is obviously a *required* attribute
|
|
|
|
- ~cache_path~
|
|
|
|
Cache is extremely useful to speed up some queries. But it's *optional*, everything should work without it.
|
|
|
|
|
|
|
|
I'll refer to this config as *specific* further in the doc, and give examples. to each point. Note that they are only illustrating the specific requirement, potentially ignoring the other ones.
|
|
Now, the requirements as I see it:
|
|
|
|
1. configuration should be *extremely* flexible
|
|
|
|
We need to make sure it's very easy to combine/filter/extend data without having to turn the module code inside out.
|
|
This means using a powerful language for config, and realistically, a Turing complete.
|
|
|
|
General: that means that you should be able to use powerful syntax, potentially running arbitrary code if
|
|
this is something you need (for whatever mad reason). It should be possible to override config attributes *in runtime*, if necessary, without rewriting files on the filesystem.
|
|
|
|
Specific: we've got Python already, so it makes a lot of sense to use it!
|
|
|
|
#+begin_src python
|
|
class bluemaestro:
|
|
export_path = '/path/to/bluemaestro/data'
|
|
cache_path = '/tmp/bluemaestro.cache'
|
|
#+end_src
|
|
|
|
Downsides:
|
|
|
|
- keeping it overly flexible and powerful means it's potentially less accessible to people less familiar with programming
|
|
|
|
But see the further point about keeping it simple. I claim that simple programs look as easy as simple json.
|
|
|
|
- Python is 'less safe' than a plain json/yaml config
|
|
|
|
But at the moment the whole thing is running potentially untrusted Python code anyway.
|
|
It's not a tool you're going to install it across your organization, run under root privileges, and let the employers tweak it.
|
|
|
|
Ultimately, you set it up for yourself, and the config has exactly the same permissions as the code you're installing.
|
|
Thinking that plain config would give you more security is deceptive, and it's a false sense of security (at this stage of the project).
|
|
|
|
# TODO I don't mind having json/toml/whatever, but only as an additional interface
|
|
|
|
I also write more about all this [[https://beepb00p.xyz/configs-suck.html][here]].
|
|
|
|
2. configuration should be *backwards compatible*
|
|
|
|
General: the whole system is pretty chaotic, it's hard to control the versioning of different modules and their compatibility.
|
|
It's important to allow changing attribute names and adding new functionality, while making sure the module works against an older version of the config.
|
|
Ideally warn the user that they'd better migrate to a newer version if the fallbacks are triggered.
|
|
Potentially: use individual versions for modules? Although it makes things a bit complicated.
|
|
|
|
Specific: say the module is using a new config attribute, ~timezone~.
|
|
We would need to adapt the module to support the old configs without timezone. For example, in ~bluemaestro.py~ (pseudo code):
|
|
|
|
#+begin_src python
|
|
user_config = load_user_config()
|
|
if not hasattr(user_config, 'timezone'):
|
|
warnings.warn("Please specify 'timezone' in the config! Falling back to the system timezone.")
|
|
user_config.timezone = get_system_timezone()
|
|
#+end_src
|
|
|
|
This is possible to achieve with pretty much any config format, just important to keep in mind.
|
|
|
|
Downsides: hopefully no one argues backwards compatibility is important.
|
|
|
|
3. configuration should be as *easy to write* as possible
|
|
|
|
General: as lean and non-verbose as possible. No extra imports, no extra inheritance, annotations, etc. Loose coupling.
|
|
|
|
Specific: the user *only* has to specify ~export_path~ to make the module function and that's it. For example:
|
|
|
|
#+begin_src js
|
|
{
|
|
'export_path': '/path/to/bluemaestro/'
|
|
}
|
|
#+end_src
|
|
|
|
It's possible to achieve with any configuration format (aided by some helpers to fill in optional attributes etc), so it's more of a guiding principle.
|
|
|
|
Downsides:
|
|
|
|
- no (mandatory) annotations means more potential to break, but I'd rather leave this decision to the users
|
|
|
|
4. configuration should be as *easy to use and extend* as possible
|
|
|
|
General: enable the users to add new config attributes and *immediately* use them without any hassle and boilerplate.
|
|
It's easy to achieve on it's own, but harder to achieve simultaneously with (2).
|
|
|
|
Specific: if you keep the config as Python, simply importing the config in the module satisfies this property:
|
|
|
|
#+begin_src python
|
|
from my.config import bluemaestro as user_config
|
|
#+end_src
|
|
|
|
If the config is in JSON or something, it's possible to load it dynamically too without the boilerplate.
|
|
|
|
Downsides: none, hopefully no one is against extensibility
|
|
|
|
5. configuration should have checks
|
|
|
|
General: make sure it's easy to track down configuration errors. At least runtime checks for required attributes, their types, warnings, that sort of thing. But a biggie for me is using *mypy* to statically typecheck the modules.
|
|
To some extent it gets in the way of (2) and (4).
|
|
|
|
Specific: using ~NamedTuple/dataclass~ has capabilities to verify the config with no extra boilerplate on the user side.
|
|
|
|
#+begin_src python
|
|
class bluemaestro(NamedTuple):
|
|
export_path: str
|
|
cache_path : Optional[str] = None
|
|
|
|
raw_config = json.load('configs/bluemaestro.json')
|
|
config = bluemaestro(**raw_config)
|
|
#+end_src
|
|
|
|
This will fail if required =export_path= is missing, and fill optional =cache_path= with None. In addition, it's ~mypy~ friendly.
|
|
|
|
Downsides: none, especially if it's possible to turn checks on/off.
|
|
|
|
6. configuration should be easy to document
|
|
|
|
General: ideally, it should be autogenerated, be self-descriptive and have some sort of schema, to make sure the documentation (which no one likes to write) doesn't diverge.
|
|
|
|
Specific: mypy annotations seem like the way to go. See the example from (5), it's pretty clear from the code what needs to be in the config.
|
|
|
|
Downsides: none, self-documented code is good.
|
|
|
|
* Solution?
|
|
|
|
Now I'll consider potential solutions to the configuration, taking the different requirements into account.
|
|
|
|
Like I already mentioned, plain configs (JSON/YAML/TOML) are very inflexible and go against (1), which in my opinion think makes them no-go.
|
|
|
|
So: my suggestion is to write the *configs as Python code*.
|
|
It's hard to satisfy all requirements *at the same time*, but I want to argue, it's possible to satisfy most of them, depending on the maturity of the module which we're configuring.
|
|
|
|
Let's say you want to write a new module. You start with a
|
|
|
|
#+begin_src python
|
|
class bluemaestro:
|
|
export_path = '/path/to/bluemaestro/data'
|
|
cache_path = '/tmp/bluemaestro.cache'
|
|
#+end_src
|
|
|
|
And to use it:
|
|
|
|
#+begin_src python
|
|
from my.config import bluemaestro as user_config
|
|
#+end_src
|
|
|
|
Let's go through requirements:
|
|
|
|
- (1): *yes*, simply importing Python code is the most flexible you can get
|
|
In addition, in runtime, you can simply assign a new config if you need some dynamic hacking:
|
|
|
|
#+begin_src python
|
|
class new_config:
|
|
export_path = '/some/hacky/dynamic/path'
|
|
my.config = new_config
|
|
#+end_src
|
|
|
|
After that, =my.bluemaestro= would run against your new config.
|
|
|
|
- (2): *no*, but backwards compatibility is not necessary in the first version of the module
|
|
- (3): *mostly*, although optional fields require extra work
|
|
- (4): *yes*, whatever is in the config can immediately be used by the code
|
|
- (5): *mostly*, imports are transparent to ~mypy~, although runtime type checks would be nice too
|
|
- (6): *no*, you have to guess the config from the usage.
|
|
|
|
This approach is extremely simple, and already *good enough for initial prototyping* or *private modules*.
|
|
|
|
The main downside so far is the lack of documentation (6), which I'll try to solve next.
|
|
I see mypy annotations as the only sane way to support it, because we also get (5) for free. So we could use:
|
|
|
|
- potentially [[https://github.com/karlicoss/HPI/issues/12#issuecomment-610038961][file-config]]
|
|
|
|
However, it's using plain files and doesn't satisfy (1).
|
|
|
|
Also not sure about (5). =file-config= allows using mypy annotations, but I'm not convinced they would be correctly typed with mypy, I think you need a plugin for that.
|
|
|
|
- [[https://mypy.readthedocs.io/en/stable/protocols.html#simple-user-defined-protocols][Protocol]]
|
|
|
|
I experimented with ~Protocol~ [[https://github.com/karlicoss/HPI/pull/45/commits/90b9d1d9c15abe3944913add5eaa5785cc3bffbc][here]].
|
|
It's pretty cool, very flexible, and doesn't impose any runtime modifications, which makes it good for (4).
|
|
|
|
The downsides are:
|
|
|
|
- it doesn't support optional attributes (optional as in non-required, not as ~typing.Optional~), so it goes against (3)
|
|
- prior to python 3.8, it's a part of =typing_extensions= rather than standard =typing=, so using it requires guarding the code with =if typing.TYPE_CHECKING=, which is a bit confusing and bloating.
|
|
|
|
TODO: check out [[https://mypy.readthedocs.io/en/stable/protocols.html#using-isinstance-with-protocols][@runtime_checkable]]?
|
|
|
|
- =NamedTuple=
|
|
|
|
[[https://github.com/karlicoss/HPI/pull/45/commits/c877104b90c9d168eaec96e0e770e59048ce4465][Here]] I experimented with using ~NamedTuple~.
|
|
|
|
Similarly to Protocol, it's self-descriptive, and in addition allows for non-required fields.
|
|
# TODO something about helper methods? can't use them with Protocol
|
|
|
|
Downsides:
|
|
- it goes against (4), because NamedTuple (being a =tuple= in runtime) can only contain the attributes declared in the schema.
|
|
|
|
- =dataclass=
|
|
|
|
Similar to =NamedTuple=, but it's possible to add extra attributes =dataclass= with ~setattr~ to implement (4).
|
|
|
|
Downsides:
|
|
- we partially lost (5), because dynamic attributes are not transparent to mypy.
|
|
|
|
|
|
My conclusion was using a *combined approach*:
|
|
|
|
- Use =@dataclass= base for documentation and default attributes, achieving (6) and (3)
|
|
- Inherit the original config class to bring in the extra attributes, achieving (4)
|
|
|
|
Inheritance is a standard mechanism, which doesn't require any extra frameworks and plays well with other Python concepts. As a specific example:
|
|
|
|
#+begin_src python
|
|
from my.config import bluemaestro as user_config
|
|
|
|
@dataclass
|
|
class bluemaestro(user_config):
|
|
'''
|
|
The header of this file contributes towards the documentation
|
|
'''
|
|
export_path: str
|
|
cache_path : Optional[str] = None
|
|
|
|
@classmethod
|
|
def make_config(cls) -> 'bluemaestro':
|
|
params = {
|
|
k: v
|
|
for k, v in vars(cls.__base__).items()
|
|
if k in {f.name for f in dataclasses.fields(cls)}
|
|
}
|
|
return cls(**params)
|
|
|
|
config = bluemaestro.make_config()
|
|
#+end_src
|
|
|
|
I claim this solves pretty much everything:
|
|
- *(1)*: yes, the config attributes are preserved and can be anything that's allowed in Python
|
|
- *(2)*: collaterally, we also solved it, because we can adapt for renames and other legacy config adaptations in ~make_config~
|
|
- *(3)*: supports default attributes, at no extra cost
|
|
- *(4)*: the user config's attributes are available through the base class
|
|
- *(5)*: everything is mostly transparent to mypy. There are no runtime type checks yet, but I think possible to integrate with ~@dataclass~
|
|
- *(6)*: the dataclass header is easily readable, and it's possible to generate the docs automatically
|
|
|
|
Downsides:
|
|
- inheriting from ~user_config~ means an early import of =my.config=
|
|
|
|
Generally it's better to keep everything as lazy as possible and defer loading to the first time the config is used.
|
|
This might be annoying at times, e.g. if you have a top-level import of you module, but no config.
|
|
|
|
But considering that in 99% of cases config is going to be on the disk
|
|
and it's [[https://github.com/karlicoss/HPI/blob/1e6e0bd381d20437343473878c7f63b1f9d6362b/tests/demo.py#L22-L25][possible]] to do something dynamic like =del sys.modules['my.bluemastro']= to reload the config, I think it's a minor issue.
|
|
|
|
- =make_config= allows for some mypy false negatives in the user config
|
|
|
|
E.g. if you forgot =export_path= attribute, mypy would miss it. But you'd have a runtime failure, and the downstream code using config is still correctly type checked.
|
|
|
|
Perhaps it will be better when [[https://github.com/python/mypy/issues/5374][this mypy issue]] is fixed.
|
|
- the =make_config= bit is a little scary and manual
|
|
|
|
However, it's extracted in a generic helper, and [[https://github.com/karlicoss/HPI/blob/d6f071e3b12ba1cd5a86ad80e3821bec004e6a6d/my/twitter/archive.py#L17][ends up pretty simple]]
|
|
|
|
# In addition, it's not even necessary if you don't have optional attributes, you can simply use the class variables (i.e. ~bluemaestro.export_path~)
|
|
# upd. ugh, you can't, it doesn't handle default attributes overriding correctly (see tests/demo.py)
|
|
# eh. basically all I need is class level dataclass??
|
|
|
|
- inheriting from ~user_config~ requires it to be a =class= rather than an =object=
|
|
|
|
A practical downside is you can't use something like ~SimpleNamespace~.
|
|
But considering you can define an ad-hoc =class= anywhere, this is fine?
|
|
|
|
My conclusion is that I'm going with this approach for now.
|
|
Note that at no stage in required any changes to the user configs, so if I missed something, it would be reversible.
|
|
|
|
* Side modules :noexport:
|
|
|
|
Some of TODO rexport?
|
|
|
|
To some extent, this is an experiment. I'm not sure how much value is in .
|
|
|
|
|
|
One thing are TODO software? libraries that have fairly well defined APIs and you can reasonably version them.
|
|
|
|
Another thing is the modules for accessing data, where you'd hopefully have everything backwards compatible.
|
|
Maybe in the future
|
|
|
|
I'm just not sure, happy to hear people's opinions on this.
|
|
|
|
|