Merge pull request #45 from karlicoss/better-configs
Better configs: safer and self documented
This commit is contained in:
commit
d6f071e3b1
13 changed files with 560 additions and 38 deletions
266
doc/CONFIGURING.org
Normal file
266
doc/CONFIGURING.org
Normal file
|
@ -0,0 +1,266 @@
|
|||
I feel like it's good to keep the rationales in the documentation,
|
||||
but happy to [[https://github.com/karlicoss/HPI/issues/46][discuss]] it here.
|
||||
|
||||
Before discussing the abstract matters, let's consider a specific situation.
|
||||
Say, we want to let the user configure [[https://github.com/karlicoss/HPI/blob/master/my/bluemaestro/__init__.py][bluemaestro]] module.
|
||||
At the moment, it uses the following config attributes:
|
||||
|
||||
- ~export_path~
|
||||
|
||||
Path to the data, this is obviously a *required* attribute
|
||||
|
||||
- ~cache_path~
|
||||
|
||||
Cache is extremely useful to speed up some queries. But it's *optional*, everything should work without it.
|
||||
|
||||
|
||||
|
||||
I'll refer to this config as *specific* further in the doc, and give examples. to each point. Note that they are only illustrating the specific requirement, potentially ignoring the other ones.
|
||||
Now, the requirements as I see it:
|
||||
|
||||
1. configuration should be *extremely* flexible
|
||||
|
||||
We need to make sure it's very easy to combine/filter/extend data without having to modify and rewrite the module code.
|
||||
This means using a powerful language for config, and realistically, a Turing complete.
|
||||
|
||||
General: that means that you should be able to use powerful syntax, potentially running arbitrary code if
|
||||
this is something you need (for whatever mad reason). It should be possible to override config attributes in runtime, if necessary.
|
||||
|
||||
Specific: we've got Python already, so it makes a lot of sense to use it!
|
||||
|
||||
#+begin_src python
|
||||
class bluemaestro:
|
||||
export_path = '/path/to/bluemaestro/data'
|
||||
cache_path = '/tmp/bluemaestro.cache'
|
||||
#+end_src
|
||||
|
||||
Downsides:
|
||||
|
||||
- keeping it overly flexible and powerful means it's potentially less accessible to people less familiar with programming
|
||||
|
||||
But see the further point about keeping it simple. I claim that simple programs look as easy as simple json.
|
||||
|
||||
- Python is 'less safe' than a plain json/yaml config
|
||||
|
||||
But at the moment the whole thing is running potentially untrusted Python code anyway.
|
||||
It's not a tool you're going to install it across your organization, run under root privileges, and let the employers tweak it.
|
||||
|
||||
Ultimately, you set it up for yourself, and the config has exactly the same permissions as the code you're installing.
|
||||
Thinking that plain config would give you more security is deceptive, and it's a false sense of security (at this stage of the project).
|
||||
|
||||
# TODO I don't mind having json/toml/whatever, but only as an additional interface
|
||||
|
||||
I also write more about all this [[https://beepb00p.xyz/configs-suck.html][here]].
|
||||
|
||||
2. configuration should be *backwards compatible*
|
||||
|
||||
General: the whole system is pretty chaotic, it's hard to control the versioning of different modules and their compatibility.
|
||||
It's important to allow changing attribute names and adding new functionality, while making sure the module works against an older version of the config.
|
||||
Ideally warn the user that they'd better migrate to a newer version if the fallbacks are triggered.
|
||||
Potentially: use individual versions for modules? Although it makes things a bit complicated.
|
||||
|
||||
Specific: say the module is using a new config attribute, ~timezone~.
|
||||
We would need to adapt the module to support the old configs without timezone. For example, in ~bluemaestro.py~ (pseudo code):
|
||||
|
||||
#+begin_src python
|
||||
user_config = load_user_config()
|
||||
if not hasattr(user_config, 'timezone'):
|
||||
warnings.warn("Please specify 'timezone' in the config! Falling back to the system timezone.")
|
||||
user_config.timezone = get_system_timezone()
|
||||
#+end_src
|
||||
|
||||
This is possible to achieve with pretty much any config format, just important to keep in mind.
|
||||
|
||||
Downsides: hopefully no one argues backwards compatibility is important.
|
||||
|
||||
3. configuration should be as *easy to write* as possible
|
||||
|
||||
General: as lean and non-verbose as possible. No extra imports, no extra inheritance, annotations, etc. Loose coupling.
|
||||
|
||||
Specific: the user *only* has to specify ~export_path~ to make the module function and that's it. For example:
|
||||
|
||||
#+begin_src js
|
||||
{
|
||||
'export_path': '/path/to/bluemaestro/'
|
||||
}
|
||||
#+end_src
|
||||
|
||||
It's possible to achieve with any configuration format (aided by some helpers to fill in optional attributes etc), so it's more of a guiding principle.
|
||||
|
||||
Downsides:
|
||||
|
||||
- no (mandatory) annotations means more potential to break, but I'd rather leave this decision to the users
|
||||
|
||||
4. configuration should be as *easy to use and extend* as possible
|
||||
|
||||
General: enable the users to add new config attributes and *immediately* use them without any hassle and boilerplate.
|
||||
It's easy to achieve on it's own, but harder to achieve simultaneously with (2).
|
||||
|
||||
Specific: if you keep the config as Python, simply importing the config in the module satisfies this property:
|
||||
|
||||
#+begin_src python
|
||||
from my.config import bluemaestro as user_config
|
||||
#+end_src
|
||||
|
||||
If the config is in JSON or something, it's possible to load it dynamically too without the boilerplate.
|
||||
|
||||
Downsides: none, hopefully no one is against extensibility
|
||||
|
||||
5. configuration should have checks
|
||||
|
||||
General: make sure it's easy to track down configuration errors. At least runtime checks for required attributes, their types, warnings, that sort of thing. But a biggie for me is using *mypy* to statically typecheck the modules.
|
||||
To some extent it gets in the way of (2) and (4).
|
||||
|
||||
Specific: using ~NamedTuple/dataclass~ has capabilities to verify the config with no extra boilerplate on the user side.
|
||||
|
||||
#+begin_src python
|
||||
class bluemaestro(NamedTuple):
|
||||
export_path: str
|
||||
cache_path : Optional[str] = None
|
||||
|
||||
raw_config = json.load('configs/bluemaestro.json')
|
||||
config = bluemaestro(**raw_config)
|
||||
#+end_src
|
||||
|
||||
This will fail if required =export_path= is missing, and fill optional =cache_path= with None. In addition, it's ~mypy~ friendly.
|
||||
|
||||
Downsides: none, especially if it's possible to turn checks on/off.
|
||||
|
||||
6. configuration should be easy to document
|
||||
|
||||
General: ideally, it should be autogenerated, be self-descriptive and have some sort of schema, to make sure the documentation (which no one likes to write) doesn't diverge.
|
||||
|
||||
Specific: mypy annotations seem like the way to go. See the example from (5), it's pretty clear from the code what needs to be in the config.
|
||||
|
||||
Downsides: none, self-documented code is good.
|
||||
|
||||
* Solution?
|
||||
|
||||
Now I'll consider potential solutions to the configuration, taking the different requirements into account.
|
||||
|
||||
Like I already mentioned, plain configs (JSON/YAML/TOML) are very inflexible and go against (1), which in my opinion think makes them no-go.
|
||||
|
||||
So: my suggestion is to write the *configs as Python code*.
|
||||
It's hard to satisfy all requirements *at the same time*, but I want to argue, it's possible to satisfy most of them, depending on the maturity of the module which we're configuring.
|
||||
|
||||
Let's say you want to write a new module. You start with a
|
||||
|
||||
#+begin_src python
|
||||
class bluemaestro:
|
||||
export_path = '/path/to/bluemaestro/data'
|
||||
cache_path = '/tmp/bluemaestro.cache'
|
||||
#+end_src
|
||||
|
||||
And to use it:
|
||||
|
||||
#+begin_src python
|
||||
from my.config import bluemaestro as user_config
|
||||
#+end_src
|
||||
|
||||
Let's go through requirements:
|
||||
|
||||
- (1): *yes*, simply importing Python code is the most flexible you can get
|
||||
- (2): *no*, but backwards compatibility is not necessary in the first version of the module
|
||||
- (3): *mostly*, although optional fields require extra work
|
||||
- (4): *yes*, whatever is in the config can immediately be used by the code
|
||||
- (5): *mostly*, imports are transparent to ~mypy~, although runtime type checks would be nice too
|
||||
- (6): *no*, you have to guess the config from the usage.
|
||||
|
||||
This approach is extremely simple, and already *good enough for initial prototyping* or *private modules*.
|
||||
|
||||
The main downside so far is the lack of documentation (6), which I'll try to solve next.
|
||||
I see mypy annotations as the only sane way to support it, because we also get (5) for free. So we could use:
|
||||
|
||||
- potentially [[https://github.com/karlicoss/HPI/issues/12#issuecomment-610038961][file-config]]
|
||||
|
||||
However, it's using plain files and doesn't satisfy (1).
|
||||
|
||||
Also not sure about (5). =file-config= allows using mypy annotations, but I'm not convinced they would be correctly typed with mypy, I think you need a plugin for that.
|
||||
|
||||
- [[https://mypy.readthedocs.io/en/stable/protocols.html#simple-user-defined-protocols][Protocol]]
|
||||
|
||||
I experimented with ~Protocol~ [[https://github.com/karlicoss/HPI/pull/45/commits/90b9d1d9c15abe3944913add5eaa5785cc3bffbc][here]].
|
||||
It's pretty cool, very flexible, and doesn't impose any runtime modifications, which makes it good for (4).
|
||||
|
||||
The downsides are:
|
||||
|
||||
- it doesn't support optional attributes (optional as in non-required, not as ~typing.Optional~), so it goes against (3)
|
||||
- prior to python 3.8, it's a part of =typing_extensions= rather than standard =typing=, so using it requires guarding the code with =if typing.TYPE_CHECKING=, which is a bit confusing and bloating.
|
||||
|
||||
- =NamedTuple=
|
||||
|
||||
[[https://github.com/karlicoss/HPI/pull/45/commits/c877104b90c9d168eaec96e0e770e59048ce4465][Here]] I experimented with using ~NamedTuple~.
|
||||
|
||||
Similarly to Protocol, it's self-descriptive, and in addition allows for non-required fields.
|
||||
# TODO something about helper methods? can't use them with Protocol
|
||||
|
||||
Downsides:
|
||||
- it goes against (4), because NamedTuple (being a =tuple= in runtime) can only contain the attributes declared in the schema.
|
||||
|
||||
- =dataclass=
|
||||
|
||||
Similar to =NamedTuple=, but it's possible to add extra attributes =dataclass= with ~setattr~ to implement (4).
|
||||
|
||||
Downsides:
|
||||
- we partially lost (5), because dynamic attributes are not transparent to mypy.
|
||||
|
||||
|
||||
My conclusion was using a *combined approach*:
|
||||
|
||||
- Use =@dataclass= base for documentation and default attributes, achieving (6) and (3)
|
||||
- Inherit the original config class to bring in the extra attributes, achieving (4)
|
||||
|
||||
Inheritance is a standard mechanism, which doesn't require any extra frameworks and plays well with other Python concepts. As a specific example:
|
||||
|
||||
#+begin_src python
|
||||
from my.config import bluemaestro as user_config
|
||||
|
||||
@dataclass
|
||||
class bluemaestro(user_config):
|
||||
'''
|
||||
The header of this file contributes towards the documentation
|
||||
'''
|
||||
export_path: str
|
||||
cache_path : Optional[str] = None
|
||||
|
||||
@classmethod
|
||||
def make_config(cls) -> 'bluemaestro':
|
||||
params = {
|
||||
k: v
|
||||
for k, v in vars(cls.__base__).items()
|
||||
if k in {f.name for f in dataclasses.fields(cls)}
|
||||
}
|
||||
return cls(**params)
|
||||
|
||||
config = reddit.make_config()
|
||||
#+end_src
|
||||
|
||||
I claim this solves pretty much everything:
|
||||
- *(1)*: yes, the config attributes are preserved and can be anything that's allowed in Python
|
||||
- *(2)*: collaterally, we also solved it, because we can adapt for renames and other legacy config adaptations in ~make_config~
|
||||
- *(3)*: supports default attributes, at no extra cost
|
||||
- *(4)*: the user config's attributes are available through the base class
|
||||
- *(5)*: everything is transparent to mypy. However, it still lacks runtime checks.
|
||||
- *(6)*: the dataclass header is easily readable, and it's possible to generate the docs automatically
|
||||
|
||||
Downsides:
|
||||
- the =make_config= bit is a little scary and manual, however, it can be extracted in a generic helper method
|
||||
|
||||
My conclusion is that I'm going with this approach for now.
|
||||
Note that at no stage in required any changes to the user configs, so if I missed something, it would be reversible.
|
||||
|
||||
* Side modules :noexport:
|
||||
|
||||
Some of TODO rexport?
|
||||
|
||||
To some extent, this is an experiment. I'm not sure how much value is in .
|
||||
|
||||
|
||||
One thing are TODO software? libraries that have fairly well defined APIs and you can reasonably version them.
|
||||
|
||||
Another thing is the modules for accessing data, where you'd hopefully have everything backwards compatible.
|
||||
Maybe in the future
|
||||
|
||||
I'm just not sure, happy to hear people's opinions on this.
|
||||
|
||||
|
108
doc/MODULES.org
Normal file
108
doc/MODULES.org
Normal file
|
@ -0,0 +1,108 @@
|
|||
This file is an overview of *documented* modules. There are many more, see [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules.
|
||||
|
||||
See [[file:SETUP.org][SETUP]] to find out how to set up your own config.
|
||||
|
||||
Some explanations:
|
||||
|
||||
- [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths
|
||||
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path=
|
||||
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths.
|
||||
|
||||
It's 'smart', allows you to be flexible about your config:
|
||||
|
||||
- simple =str= or a =Path=
|
||||
- =/a/path/to/directory/=, so the module will consume all files from this directory
|
||||
- a list of files/directories (it will be flattened)
|
||||
- a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
|
||||
|
||||
Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]].
|
||||
|
||||
- if the field has a default value, you can omit it from your private config.
|
||||
|
||||
|
||||
Modules:
|
||||
|
||||
#+begin_src python :dir .. :results output drawer :exports result
|
||||
# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
|
||||
import importlib
|
||||
# from lint import all_modules # meh
|
||||
# TODO figure out how to discover configs automatically...
|
||||
modules = [
|
||||
('google' , 'my.google.takeout.paths'),
|
||||
('reddit' , 'my.reddit' ),
|
||||
('twint' , 'my.twitter.twint' ),
|
||||
('twitter', 'my.twitter.archive' ),
|
||||
]
|
||||
|
||||
def indent(s, spaces=4):
|
||||
return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))
|
||||
|
||||
from pathlib import Path
|
||||
import inspect
|
||||
from dataclasses import fields
|
||||
import re
|
||||
print('\n') # ugh. hack for org-ruby drawers bug
|
||||
for cls, p in modules:
|
||||
m = importlib.import_module(p)
|
||||
C = getattr(m, cls)
|
||||
src = inspect.getsource(C)
|
||||
i = src.find('@property')
|
||||
if i != -1:
|
||||
src = src[:i]
|
||||
src = src.strip()
|
||||
src = re.sub(r'(class \w+)\(.*', r'\1:', src)
|
||||
mpath = p.replace('.', '/')
|
||||
for x in ['.py', '__init__.py']:
|
||||
if Path(mpath + x).exists():
|
||||
mpath = mpath + x
|
||||
print(f'- [[file:../{mpath}][{p}]]')
|
||||
mdoc = m.__doc__
|
||||
if mdoc is not None:
|
||||
print(indent(mdoc))
|
||||
print(f' #+begin_src python')
|
||||
print(indent(src))
|
||||
print(f' #+end_src')
|
||||
#+end_src
|
||||
|
||||
#+RESULTS:
|
||||
:results:
|
||||
|
||||
|
||||
- [[file:../my/google/takeout/paths.py][my.google.takeout.paths]]
|
||||
|
||||
Module for locating and accessing [[https://takeout.google.com][Google Takeout]] data
|
||||
|
||||
#+begin_src python
|
||||
class google:
|
||||
takeout_path: Paths # path/paths/glob for the takeout zips
|
||||
#+end_src
|
||||
- [[file:../my/reddit.py][my.reddit]]
|
||||
|
||||
Reddit data: saved items/comments/upvotes/etc.
|
||||
|
||||
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
|
||||
|
||||
#+begin_src python
|
||||
class reddit:
|
||||
export_path: Paths # path[s]/glob to the exported data
|
||||
rexport : Optional[PathIsh] = None # path to a local clone of rexport
|
||||
#+end_src
|
||||
- [[file:../my/twitter/twint.py][my.twitter.twint]]
|
||||
|
||||
Twitter data (tweets and favorites).
|
||||
|
||||
Uses [[https://github.com/twintproject/twint][Twint]] data export.
|
||||
|
||||
#+begin_src python
|
||||
class twint:
|
||||
export_path: Paths # path[s]/glob to the twint Sqlite database
|
||||
#+end_src
|
||||
- [[file:../my/twitter/archive.py][my.twitter.archive]]
|
||||
|
||||
Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])
|
||||
|
||||
#+begin_src python
|
||||
class twitter:
|
||||
export_path: Paths # path[s]/glob to the twitter archive takeout
|
||||
#+end_src
|
||||
:end:
|
|
@ -73,6 +73,9 @@ They aren't necessary, but improve your experience. At the moment these are:
|
|||
This is an *optional step* as some modules might work without extra setup.
|
||||
But it depends on the specific module.
|
||||
|
||||
You might also find interesting to read [[file:CONFIGURING.org][CONFIGURING]], where I'm
|
||||
elaborating on some rationales behind the current configuration system.
|
||||
|
||||
** private configuration (=my.config=)
|
||||
# TODO write about dynamic configuration
|
||||
# TODO add a command to edit config?? e.g. HPI config edit
|
||||
|
@ -103,12 +106,15 @@ Since it's a Python package, generally it's very *flexible* and there are many w
|
|||
username = 'karlicoss'
|
||||
|
||||
#+end_src
|
||||
|
||||
I'm [[https://github.com/karlicoss/HPI/issues/12][working]] on improving the documentation for configuring the individual modules,
|
||||
but in the meantime the easiest is perhaps to skim through the code of the module and see what config attributes it's using.
|
||||
|
||||
For example, if you search for =config.= in [[file:../my/emfit/__init__.py][emfit module]], you'll see that it's using =export_path=, =tz=, =excluded_sids= and =cache_path=.
|
||||
Or you can just try running them and fill in the attributes Python complains about.
|
||||
To find out which attributes you need to specify:
|
||||
|
||||
- check in [[file:MODULES.org][MODULES]]
|
||||
- if there is nothing there, the easiest is perhaps to skim through the code of the module and to search for =config.= uses.
|
||||
|
||||
For example, if you search for =config.= in [[file:../my/emfit/__init__.py][emfit module]], you'll see that it's using =export_path=, =tz=, =excluded_sids= and =cache_path=.
|
||||
|
||||
- or you can just try running them and fill in the attributes Python complains about!
|
||||
|
||||
- My config layout is a bit more complicated:
|
||||
|
||||
|
|
|
@ -16,11 +16,14 @@ After that, you can set config attributes:
|
|||
import my.config as config
|
||||
|
||||
|
||||
def set_repo(name: str, repo):
|
||||
from pathlib import Path
|
||||
from typing import Union
|
||||
def set_repo(name: str, repo: Union[Path, str]) -> None:
|
||||
from .core.init import assign_module
|
||||
from . common import import_from
|
||||
|
||||
module = import_from(repo, name)
|
||||
r = Path(repo)
|
||||
module = import_from(r.parent, name)
|
||||
assign_module('my.config.repos', name, module)
|
||||
|
||||
|
||||
|
|
18
my/core/cfg.py
Normal file
18
my/core/cfg.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from typing import TypeVar, Type, Callable, Dict, Any
|
||||
|
||||
Attrs = Dict[str, Any]
|
||||
|
||||
C = TypeVar('C')
|
||||
|
||||
# todo not sure about it, could be overthinking...
|
||||
# but short enough to change later
|
||||
def make_config(cls: Type[C], migration: Callable[[Attrs], Attrs]=lambda x: x) -> C:
|
||||
props = dict(vars(cls.__base__))
|
||||
props = migration(props)
|
||||
from dataclasses import fields
|
||||
params = {
|
||||
k: v
|
||||
for k, v in props.items()
|
||||
if k in {f.name for f in fields(cls)}
|
||||
}
|
||||
return cls(**params) # type: ignore[call-arg]
|
|
@ -195,3 +195,27 @@ def fastermime(path: PathIsh) -> str:
|
|||
|
||||
|
||||
Json = Dict[str, Any]
|
||||
|
||||
|
||||
from typing import TypeVar, Callable, Generic
|
||||
|
||||
_C = TypeVar('_C')
|
||||
_R = TypeVar('_R')
|
||||
|
||||
# https://stackoverflow.com/a/5192374/706389
|
||||
class classproperty(Generic[_R]):
|
||||
def __init__(self, f: Callable[[_C], _R]) -> None:
|
||||
self.f = f
|
||||
|
||||
def __get__(self, obj: None, cls: _C) -> _R:
|
||||
return self.f(cls)
|
||||
|
||||
|
||||
# hmm, this doesn't really work with mypy well..
|
||||
# https://github.com/python/mypy/issues/6244
|
||||
# class staticproperty(Generic[_R]):
|
||||
# def __init__(self, f: Callable[[], _R]) -> None:
|
||||
# self.f = f
|
||||
#
|
||||
# def __get__(self) -> _R:
|
||||
# return self.f()
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
from functools import lru_cache
|
||||
from datetime import datetime
|
||||
from datetime import datetime, tzinfo
|
||||
|
||||
import pytz # type: ignore
|
||||
|
||||
|
@ -11,6 +11,7 @@ tz_lookup = {
|
|||
tz_lookup['UTC'] = pytz.utc # ugh. otherwise it'z Zulu...
|
||||
|
||||
|
||||
# TODO dammit, lru_cache interferes with mypy?
|
||||
@lru_cache(None)
|
||||
def abbr_to_timezone(abbr: str):
|
||||
def abbr_to_timezone(abbr: str) -> tzinfo:
|
||||
return tz_lookup[abbr]
|
||||
|
|
|
@ -1,10 +1,27 @@
|
|||
'''
|
||||
Module for locating and accessing [[https://takeout.google.com][Google Takeout]] data
|
||||
'''
|
||||
|
||||
from dataclasses import dataclass
|
||||
from ...core.common import Paths
|
||||
|
||||
from my.config import google as user_config
|
||||
@dataclass
|
||||
class google(user_config):
|
||||
takeout_path: Paths # path/paths/glob for the takeout zips
|
||||
###
|
||||
|
||||
# TODO rename 'google' to 'takeout'? not sure
|
||||
|
||||
from ...core.cfg import make_config
|
||||
config = make_config(google)
|
||||
|
||||
from pathlib import Path
|
||||
from typing import Optional, Iterable
|
||||
|
||||
from ...common import get_files
|
||||
from ...kython.kompress import kopen, kexists
|
||||
|
||||
from my.config import google as config
|
||||
|
||||
def get_takeouts(*, path: Optional[str]=None) -> Iterable[Path]:
|
||||
"""
|
||||
|
|
74
my/reddit.py
74
my/reddit.py
|
@ -1,26 +1,74 @@
|
|||
"""
|
||||
Reddit data: saved items/comments/upvotes/etc.
|
||||
|
||||
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
|
||||
"""
|
||||
from pathlib import Path
|
||||
|
||||
from typing import Optional
|
||||
from .core.common import Paths, PathIsh
|
||||
|
||||
from types import ModuleType
|
||||
from my.config import reddit as uconfig
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class reddit(uconfig):
|
||||
export_path: Paths # path[s]/glob to the exported data
|
||||
rexport : Optional[PathIsh] = None # path to a local clone of rexport
|
||||
|
||||
@property
|
||||
def rexport_module(self) -> ModuleType:
|
||||
# todo return Type[rexport]??
|
||||
# todo ModuleIsh?
|
||||
rpath = self.rexport
|
||||
if rpath is not None:
|
||||
from my.cfg import set_repo
|
||||
set_repo('rexport', rpath)
|
||||
|
||||
import my.config.repos.rexport.dal as m
|
||||
return m
|
||||
|
||||
|
||||
from .core.cfg import make_config, Attrs
|
||||
# hmm, also nice thing about this is that migration is possible to test without the rest of the config?
|
||||
def migration(attrs: Attrs) -> Attrs:
|
||||
if 'export_dir' in attrs: # legacy name
|
||||
attrs['export_path'] = attrs['export_dir']
|
||||
return attrs
|
||||
config = make_config(reddit, migration=migration)
|
||||
|
||||
###
|
||||
# TODO not sure about the laziness...
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
if TYPE_CHECKING:
|
||||
# TODO not sure what is the right way to handle this..
|
||||
import my.config.repos.rexport.dal as rexport
|
||||
else:
|
||||
# TODO ugh. this would import too early
|
||||
# but on the other hand we do want to bring the objects into the scope for easier imports, etc. ugh!
|
||||
# ok, fair enough I suppose. It makes sense to configure something before using it. can always figure it out later..
|
||||
# maybe, the config could dynamically detect change and reimport itself? dunno.
|
||||
rexport = config.rexport_module
|
||||
###
|
||||
|
||||
|
||||
from typing import List, Sequence, Mapping, Iterator
|
||||
from .core.common import mcachew, get_files, LazyLogger, make_dict
|
||||
|
||||
|
||||
logger = LazyLogger(__name__, level='debug')
|
||||
|
||||
|
||||
from pathlib import Path
|
||||
from .kython.kompress import CPath
|
||||
from .common import mcachew, get_files, LazyLogger, make_dict
|
||||
|
||||
from my.config import reddit as config
|
||||
import my.config.repos.rexport.dal as rexport
|
||||
|
||||
|
||||
def inputs() -> Sequence[Path]:
|
||||
# TODO rename to export_path?
|
||||
files = get_files(config.export_dir)
|
||||
files = get_files(config.export_path)
|
||||
# TODO Cpath better be automatic by get_files...
|
||||
res = list(map(CPath, files)); assert len(res) > 0
|
||||
# todo move the assert to get_files?
|
||||
return tuple(res)
|
||||
|
||||
logger = LazyLogger(__name__, level='debug')
|
||||
|
||||
|
||||
Sid = rexport.Sid
|
||||
Save = rexport.Save
|
||||
|
@ -64,10 +112,6 @@ from multiprocessing import Pool
|
|||
|
||||
# TODO hmm. apparently decompressing takes quite a bit of time...
|
||||
|
||||
def reddit(suffix: str) -> str:
|
||||
return 'https://reddit.com' + suffix
|
||||
|
||||
|
||||
class SaveWithDt(NamedTuple):
|
||||
save: Save
|
||||
backup_dt: datetime
|
||||
|
|
|
@ -1,6 +1,22 @@
|
|||
"""
|
||||
Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])
|
||||
"""
|
||||
from dataclasses import dataclass
|
||||
from ..core.common import Paths
|
||||
|
||||
from my.config import twitter as user_config
|
||||
|
||||
@dataclass
|
||||
class twitter(user_config):
|
||||
export_path: Paths # path[s]/glob to the twitter archive takeout
|
||||
|
||||
|
||||
###
|
||||
|
||||
from ..core.cfg import make_config
|
||||
config = make_config(twitter)
|
||||
|
||||
|
||||
from datetime import datetime
|
||||
from typing import Union, List, Dict, Set, Optional, Iterator, Any, NamedTuple
|
||||
from pathlib import Path
|
||||
|
@ -13,14 +29,13 @@ import pytz
|
|||
from ..common import PathIsh, get_files, LazyLogger, Json
|
||||
from ..kython import kompress
|
||||
|
||||
from my.config import twitter as config
|
||||
|
||||
|
||||
logger = LazyLogger(__name__)
|
||||
|
||||
|
||||
def _get_export() -> Path:
|
||||
return max(get_files(config.export_path, '*.zip'))
|
||||
return max(get_files(config.export_path))
|
||||
|
||||
|
||||
Tid = str
|
||||
|
|
|
@ -1,24 +1,34 @@
|
|||
"""
|
||||
Twitter data (tweets and favorites). Uses [[https://github.com/twintproject/twint][Twint]] data export.
|
||||
Twitter data (tweets and favorites).
|
||||
|
||||
Uses [[https://github.com/twintproject/twint][Twint]] data export.
|
||||
"""
|
||||
|
||||
from ..core.common import Paths
|
||||
from dataclasses import dataclass
|
||||
from my.config import twint as user_config
|
||||
|
||||
@dataclass
|
||||
class twint(user_config):
|
||||
export_path: Paths # path[s]/glob to the twint Sqlite database
|
||||
|
||||
|
||||
from ..core.cfg import make_config
|
||||
config = make_config(twint)
|
||||
|
||||
|
||||
from datetime import datetime
|
||||
from typing import NamedTuple, Iterable, List
|
||||
from pathlib import Path
|
||||
|
||||
from ..common import PathIsh, get_files, LazyLogger, Json
|
||||
from ..core.common import get_files, LazyLogger, Json
|
||||
from ..core.time import abbr_to_timezone
|
||||
|
||||
from my.config import twint as config
|
||||
|
||||
|
||||
log = LazyLogger(__name__)
|
||||
|
||||
|
||||
def get_db_path() -> Path:
|
||||
# TODO don't like the hardcoded extension. maybe, config should decide?
|
||||
# or, glob only applies to directories?
|
||||
return max(get_files(config.export_path, glob='*.db'))
|
||||
return max(get_files(config.export_path))
|
||||
|
||||
|
||||
class Tweet(NamedTuple):
|
||||
|
|
|
@ -55,8 +55,7 @@ DAL = None
|
|||
''')
|
||||
|
||||
from my.cfg import set_repo
|
||||
# FIXME meh. hot sure about setting the parent??
|
||||
set_repo('hypexport', tmp_path)
|
||||
set_repo('hypexport', fake_hypexport)
|
||||
|
||||
# should succeed now!
|
||||
import my.hypothesis
|
||||
|
|
|
@ -1,16 +1,17 @@
|
|||
from datetime import datetime
|
||||
import pytz
|
||||
|
||||
from my.reddit import events, inputs, saved
|
||||
from my.common import make_dict
|
||||
|
||||
|
||||
def test() -> None:
|
||||
from my.reddit import events, inputs, saved
|
||||
list(events())
|
||||
list(saved())
|
||||
|
||||
|
||||
def test_unfav() -> None:
|
||||
from my.reddit import events, inputs, saved
|
||||
ev = events()
|
||||
url = 'https://reddit.com/r/QuantifiedSelf/comments/acxy1v/personal_dashboard/'
|
||||
uev = [e for e in ev if e.url == url]
|
||||
|
@ -23,6 +24,7 @@ def test_unfav() -> None:
|
|||
|
||||
|
||||
def test_saves() -> None:
|
||||
from my.reddit import events, inputs, saved
|
||||
# TODO not sure if this is necesasry anymore?
|
||||
saves = list(saved())
|
||||
# just check that they are unique..
|
||||
|
@ -30,6 +32,7 @@ def test_saves() -> None:
|
|||
|
||||
|
||||
def test_disappearing() -> None:
|
||||
from my.reddit import events, inputs, saved
|
||||
# eh. so for instance, 'metro line colors' is missing from reddit-20190402005024.json for no reason
|
||||
# but I guess it was just a short glitch... so whatever
|
||||
saves = events()
|
||||
|
@ -39,12 +42,18 @@ def test_disappearing() -> None:
|
|||
|
||||
|
||||
def test_unfavorite() -> None:
|
||||
from my.reddit import events, inputs, saved
|
||||
evs = events()
|
||||
unfavs = [s for s in evs if s.text == 'unfavorited']
|
||||
[xxx] = [u for u in unfavs if u.eid == 'unf-19ifop']
|
||||
assert xxx.dt == datetime(2019, 1, 28, 8, 10, 20, tzinfo=pytz.utc)
|
||||
|
||||
|
||||
def test_extra_attr() -> None:
|
||||
from my.reddit import config
|
||||
assert isinstance(getattr(config, 'passthrough'), str)
|
||||
|
||||
|
||||
import pytest # type: ignore
|
||||
@pytest.fixture(autouse=True, scope='module')
|
||||
def prepare():
|
||||
|
@ -55,3 +64,5 @@ def prepare():
|
|||
# first bit is for 'test_unfavorite, the second is for test_disappearing
|
||||
files = files[300:330] + files[500:520]
|
||||
config.export_dir = files # type: ignore
|
||||
|
||||
setattr(config, 'passthrough', "isn't handled, but available dynamically nevertheless")
|
||||
|
|
Loading…
Add table
Reference in a new issue