Compare commits

...

712 commits

Author SHA1 Message Date
Dima Gerasimov
bb703c8c6a twitter.android: fix get_own_user_id for latest exports 2024-12-29 15:48:15 +00:00
Dima Gerasimov
54df429f61 core.sqlite: add helper SqliteTool to get table schemas 2024-12-29 15:16:03 +00:00
purarue
f1d23c5e96 smscalls: allow large XML files as input
once XML files increase past a certain size
(was about 220MB for me), the parser just throws
an error because the tree is too large (iirc for
security reasons)

could maybe look at using iterparse in the future
to parse it without loading the whole file, but this
seems to fix it fine for me
2024-12-28 21:46:28 +00:00
purarue
d8c53bde34 smscalls: add phone number to model 2024-11-26 21:53:52 +00:00
purarue
95a16b956f
doc: some performance notes for query_range (#409)
* doc: some performance notes for query_range
* add ruff_cache to gitignore
2024-11-26 21:53:10 +00:00
purarue
a7f05c2cad doc: spelling fixes 2024-11-26 21:51:40 +00:00
Srajan Garg
ad55c5c345
fix typo in rexport DAL (#405)
* fix typo in rexport DAL
2024-11-13 00:05:27 +00:00
purarue
7ab6f0d5cb chore: update urls 2024-10-30 20:12:00 +00:00
Dima Gerasimov
a2b397ec4a my.whatsapp.android: adapt to new db format 2024-10-22 21:35:52 +01:00
Dima Gerasimov
8496d131e7 general: migrate modules to use 3.9 features 2024-10-19 23:41:22 +01:00
karlicoss
d3f9a8e8b6
core: migrate code to benefit from 3.9 stuff (#401)
for now keeping ruff on 3.8 target version, need to sort out modules as well
2024-10-19 20:55:09 +01:00
Dima Gerasimov
bc7c3ac253 general: python3.9 reached EOL, switch min version
also enable 3.13 on CI
2024-10-19 18:58:17 +01:00
Dima Gerasimov
a8f86e32b9 core.time: hotfix for default force_abbreviations attribute 2024-09-23 22:04:41 +01:00
Dima Gerasimov
6a6d157040 cli: fix minor race condition in creating hpi_temp_dir 2024-09-23 01:22:16 +01:00
Dima Gerasimov
bf8af6c598 tox: try using uv for CI, should result in speedup
see https://github.com/karlicoss/HPI/issues/391
2024-09-23 01:22:16 +01:00
Dima Gerasimov
8ed9e1947e my.youtube.takeout: deduplicate watched videos and sort out a few minor errors 2024-09-22 23:46:41 +01:00
Dima Gerasimov
75639a3d5e tox: some prep for potentially using uv on CI instead of pip
see https://github.com/karlicoss/HPI/issues/391
2024-09-22 20:10:52 +01:00
Dima Gerasimov
3166109f15 my.core: fix list constructor in always_support_sequence and add some tests 2024-09-22 04:35:30 +01:00
Dima Gerasimov
02dabe9f2b my.twitter.archive: cleanup linting and use proper configuration via abstract class 2024-09-22 02:13:10 +01:00
Dima Gerasimov
239e6617fe my.twitter.archive: deduplicate tweets based on id_str/created_at and raw tweet text 2024-09-22 02:13:10 +01:00
Dima Gerasimov
e036cc9e85 my.twitter.android: get own user id as string, consistent with rest of module 2024-09-22 02:13:10 +01:00
Dima Gerasimov
2ca323da84 my.fbmessenger.android: exclude unsent messages to avoid duplication 2024-09-21 23:25:25 +01:00
Dima Gerasimov
6a18f47c37 my.github.gdpr/my.zulip.organization: use kompress support for tar.gz if it's available
otherwise fall back onto unpacking into tmp dir via my.core.structure
2024-09-18 23:35:03 +01:00
Dima Gerasimov
201ddd4d7c my.core.structure: add support for .tar.gz archives
this will be useful to migrate .tar.gz processing to kompress in a backwards compatible way, or to run them against unpacked folder structure if user prefers
2024-09-17 00:25:17 +01:00
Dima Gerasimov
27178c0939 my.google.takeout.parser: speedup event merging on newer google_takeout_parser versions 2024-09-13 02:31:12 +01:00
Dima Gerasimov
71fdeca5e1 ci: update mypy config and make ruff config more consistent with other projects 2024-08-31 02:17:49 +01:00
Dima Gerasimov
d58453410c ruff: process remaining existing checks and suppress the annoying ones 2024-08-28 04:06:32 +01:00
Dima Gerasimov
1c5efc46aa ruff: enable TRY rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
affa79ba3a my.time.tz.via_location: fix accidental RuntimeError introduced in previous MR 2024-08-28 04:06:32 +01:00
Dima Gerasimov
fc0e0be291 ruff: enable ICN and PD rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
c5df3ce128 ruff: enable W, COM, EXE rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
ac08af7aab ruff: enable PT (pytest) rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
9fd4227abf ruff: enable RET/PIE/PLW 2024-08-28 04:06:32 +01:00
Dima Gerasimov
bd1e5d2f11 ruff: enable PERF checks set 2024-08-28 04:06:32 +01:00
Dima Gerasimov
985c0f94e6 ruff: attempt to enable ARG checks, suppress in some places 2024-08-28 04:06:32 +01:00
Dima Gerasimov
72cc8ff3ac ruff: enable B warnings (mainly suppressed exceptions and unused variables) 2024-08-28 04:06:32 +01:00
Dima Gerasimov
d0df8e8f2d ruff: enable PLR rules and fix bug in my.github.gdpr._is_bot 2024-08-28 04:06:32 +01:00
Dima Gerasimov
b594377a59 ruff: enable RUF ruleset 2024-08-28 04:06:32 +01:00
Dima Gerasimov
664c40e3e8 ruff: enable FBT rules to detect boolean arguments use without kwargs 2024-08-28 04:06:32 +01:00
Dima Gerasimov
118c2d4484 ruff: enable UP ruleset for detecting python deprecations 2024-08-28 04:06:32 +01:00
Dima Gerasimov
d244c7cc4e ruff: enable and fix C4 ruleset 2024-08-28 04:06:32 +01:00
Dima Gerasimov
c08ddbc781 general: small updates for typing while trying out pyright 2024-08-28 04:06:32 +01:00
Dima Gerasimov
b1fe23b8d0 my.rss.feedly/my.twittr.talon -- migrate to use lazy user configs 2024-08-26 04:00:58 +01:00
Dima Gerasimov
b87d1c970a tests: move remaining tests from tests/ to my.tests, cleanup corresponding modules 2024-08-26 04:00:58 +01:00
Dima Gerasimov
a5643206a0 general: make time.tz.via_location user config lazy, move tests to my.tests package
also gets rid of the problematic reset_modules thingie
2024-08-26 04:00:58 +01:00
Dima Gerasimov
270080bd56 core.error: better defensive handling for my.core.source when parts of config are missing 2024-08-26 04:00:58 +01:00
Dima Gerasimov
094519acaf tests: disable cachew in my.tests subpackage 2024-08-26 04:00:58 +01:00
Dima Gerasimov
7cae9d5bf3 my.google.takeout.paths: migrate to new style lazy config
also clean up tests a little and move into my.tests.location.google
2024-08-26 04:00:58 +01:00
Dima Gerasimov
2ff2dcfc00 tests: move test checkign for my_config handling to core/tests/test_config.py
allows to remove the hacky reset_modules thing from setup fixture
2024-08-25 20:49:56 +01:00
Dima Gerasimov
1215181af5 core: move stuff from tests/demo.py to my/core/tests/test_config.py
also clean all this up a bit
2024-08-25 20:49:56 +01:00
Dima Gerasimov
5a67f0bafe pdfs: migrate config to Protocol with properties
allowes to remove a whole bunch of hacky crap from tests!
2024-08-25 20:49:56 +01:00
Dima Gerasimov
d154825591 my.bluemaestro: make config construction lazy
following the discussions here: https://github.com/karlicoss/HPI/issues/46#issuecomment-2295464073
2024-08-25 20:49:56 +01:00
Dima Gerasimov
9f017fb29b my.core.pandas: add more tests 2024-08-20 00:15:15 +01:00
karlicoss
5ec357915b core.common: add test for classproperty 2024-08-17 13:05:56 +01:00
karlicoss
245ad22057 core.common: bring back asdict backwards compat -- was used in orger 2024-08-17 13:05:56 +01:00
Dima Gerasimov
7bfce72b7c core: cleanup/sort imports according to ruff check --select I 2024-08-16 11:38:13 +01:00
Dima Gerasimov
7023088d13 core.common: deprecate outdated LazyLogger alias 2024-08-16 10:22:29 +01:00
Dima Gerasimov
614c929f95 core.common: move Json, datetime_aware, datetime_naive, is_namedtuple, asdict to my.core.types 2024-08-16 10:22:29 +01:00
Dima Gerasimov
2b0f92c883 my.core: deprecate Path/dataclass imports from my.core during type checking
runtime still works for backwards compatibility
2024-08-16 10:22:29 +01:00
Dima Gerasimov
7f8a502310 core.common: move assert_subpackage to my.core.internal 2024-08-16 10:22:29 +01:00
Dima Gerasimov
88f3c17c27 core.common: move mime-related stuff to my.core.mime
no backward compat, unlikely it was used by anyone else
2024-08-16 10:22:29 +01:00
Dima Gerasimov
c45c51af22 core.common: move stats-related stuff to my.core.stats and add more thorough tests/docs
deprecate core.common.stat and core.common.Stats with backwards compatibility
2024-08-16 10:22:29 +01:00
Dima Gerasimov
18529257e7 core.common: move DummyExecutor to core.common.utils.concurrent
without backwards compat, unlikely it's been used by anyone
2024-08-16 10:22:29 +01:00
Dima Gerasimov
bcc4c15304 core: cleanup my.core.common.unique_everseen
- move to my.core.utils.itertools
- more robust check for hashable types -- now checks in runtime (since the one based on types purely isn't necessarily sound)
- add more testing
2024-08-16 10:22:29 +01:00
Dima Gerasimov
06084a8787 my.core.common: move warn_if_empty to my.core.utils.itertools, cleanup and add more tests 2024-08-16 10:22:29 +01:00
Dima Gerasimov
770dba5506 core.common: move away import related stuff to my.core.utils.imports
moving without backward compatibility, since it's extremely unlikely they are used for any external modules

in fact, unclear if these methods still have much value at all, but keeping for now just in case
2024-08-16 10:22:29 +01:00
Dima Gerasimov
66c08a6c80 core.common: move listify to core.utils.itertools, use better typing annotations for it
also some minor refactoring of my.rss
2024-08-16 10:22:29 +01:00
Dima Gerasimov
c64d7f5b67 core: cleanup itertool style helpers
- deprecate group_by_key, should use itertool.bucket instead
- move make_dict and ensure_unique to my.core.utils.itertools
2024-08-16 10:22:29 +01:00
Dima Gerasimov
973c4205df core: cleanup deprecations, exclude from type checking and show runtime warnings
among affected things:

- core.common.assert_never
- core.common.cproperty
- core.common.isoparse
- core.common.mcachew
- core.common.the
- core.common.tzdatetime
- core.compat.sqlite_backup
2024-08-16 10:22:29 +01:00
Dima Gerasimov
a7439c7846 general: move assert_never to my.core.compat as it's in stdlib from 3.11
rely on typing-extensions for fallback

introducing typing-extensions dependency without fallback, should be ok since it's in the top 10 of popular packages
2024-08-16 10:22:29 +01:00
Dima Gerasimov
1317914bff general: add 'destructive parsing' (kinda what we were doing in my.core.konsume) to my.experimental
also some cleanup for my.codeforces and my.topcoder
2024-08-12 13:24:28 +01:00
Dima Gerasimov
1e1e8d8494 my.topcoder: get rid of kjson in favor of using builtin dict methods 2024-08-12 13:24:28 +01:00
Dima Gerasimov
069264ce52 core.common: get rid of deprecated utcfromtimestamp 2024-08-10 17:46:30 +01:00
Dima Gerasimov
c69a0b43ba my.vk.favorites: some minor cleanup 2024-08-10 17:46:30 +01:00
Dima Gerasimov
34593c032d tests: move more tests into core, more consistent tests running in tox 2024-08-07 01:08:39 +01:00
Dima Gerasimov
074e24c309 general: deprecate my.core.dataset and simplify tox file 2024-08-07 01:08:39 +01:00
Dima Gerasimov
fb8e9909a4 tests: simplify tests for my.core.serialize a bit and simplify tox file 2024-08-07 01:08:39 +01:00
Dima Gerasimov
3aebc573e8 tests: use updated conftest from pymplate, this allows to run individual test modules properly
e.g. pytest --pyargs my.core.tests.test_get_files
2024-08-06 20:55:16 +01:00
Dima Gerasimov
b615ba10b1 ci: temporary suppress pandas mypy error in check_dateish 2024-08-05 23:35:24 +01:00
Dima Gerasimov
2c63fe25c0 my.twitter.android: get data from statues table rather that timeline_view 2024-08-05 23:35:24 +01:00
Dima Gerasimov
652ee9b875 fbmessenger.android: fix minor issue with processing thread participants 2024-08-03 19:01:51 +01:00
Dima Gerasimov
9e72672b4f legacy google takeout: fix timezone localization 2024-08-03 16:50:09 +01:00
karlicoss
d5fccf1874 twitter.android: more comments on timeline types 2024-08-03 16:50:09 +01:00
Dima Gerasimov
0e6dd32afe ci: minor fixes after mypy update 2024-08-03 16:18:32 +01:00
Dima Gerasimov
c9c0e19543 my.instagram.gdpr: fix for new format 2024-08-03 16:18:32 +01:00
seanbreckenridge
35dd5d82a0
smscalls: parse mms from smscalls export (#370)
* initial mms exploration
2024-06-05 22:03:03 +01:00
Dima Gerasimov
8a8a1ebb0e my.tinder.android: better error handing and fix case with empty db 2024-04-03 20:13:40 +01:00
Dima Gerasimov
103ea2096e my.coding.commits: fix for git repo discovery after fdfind v9 2024-03-13 00:46:18 +00:00
Dima Gerasimov
751ed02f43 tests: pin pytest version to <8 for now, having some test collection errors
https://docs.pytest.org/en/stable/changelog.html#collection-changes
2024-03-13 00:46:18 +00:00
Dima Gerasimov
477b7e8fd3 docs: minor update to overlays docs 2024-03-13 00:46:18 +00:00
Dima Gerasimov
0f3d09915c ci: update actions versions 2024-03-13 00:46:18 +00:00
Dima Gerasimov
7236024c7a my.twitter.android: better detection of own user id 2024-03-13 00:46:18 +00:00
Dima Gerasimov
87a8a7781b my.google.maps: intitial module for extracting placed data from Android app 2024-01-01 23:46:02 +00:00
Sean Breckenridge
93e475795d google takeout: support multiple locales
uses the known locales in google_takeout_parser
to determine the expected paths for each locale,
and performs a partial match on the paths to
detect and use match_structure
2023-12-31 18:57:30 +00:00
Dima Gerasimov
1b187b2c1b whatsapp.android: expose all entities extracted from the db 2023-12-29 00:57:49 +00:00
Dima Gerasimov
3ec362fce9 fbmessenger.android: expose contacts 2023-12-28 18:13:16 +00:00
karlicoss
a0ce666024 my.youtube.takeout: fix exception handling 2023-12-28 00:25:05 +00:00
karlicoss
1c452b12d4 twitter.android: extract likes and own tweets as well 2023-12-28 00:12:39 +00:00
karlicoss
51209c547e my.twitter.android: refactor into a proper module
for now only extracting bookmarks, will use it for some time and see how it goes
2023-12-24 00:49:07 +00:00
karlicoss
a4a7bc41b9 my.twitter.android: extract entities 2023-12-24 00:49:07 +00:00
karlicoss
3d75abafe9 my.twitter.android: some intial work on pasring sqlite databases from official Android app 2023-12-24 00:49:07 +00:00
Dima Gerasimov
a8f8858cb1 docs: document more experiments with overlays in docs 2023-12-22 02:54:36 +00:00
Dima Gerasimov
adbc0e73a2 docs: add note about directly checking overlays with mypy 2023-12-22 02:54:36 +00:00
Dima Gerasimov
84d835962d docs: some documentation/thoughts on properly implementing overlay packages 2023-12-20 02:51:27 +00:00
Sean Breckenridge
224ba521e3 gpslogger: catch broken xml file error 2023-12-20 02:41:52 +00:00
Dima Gerasimov
a843407e40 core/compat: move fromisoformat to .core.compat module 2023-11-19 23:45:08 +00:00
karlicoss
09e0f66892 tox: disable --parallel flag in hpi module install
It's been so flaky it ends up taking more time to merge stuff. See https://github.com/karlicoss/HPI/issues/306
2023-11-19 19:18:19 +00:00
Dima Gerasimov
bde43d6a7a my.body.sleep: massive speedup for average temperature calculation 2023-11-11 00:42:49 +00:00
karlicoss
37643c098f tox: remove cat coverage index from tox, it's not very useful anyway 2023-11-10 23:11:54 +00:00
karlicoss
7b1cec9326 codeforces/topcode: move to top level and check in ci 2023-11-10 23:11:54 +00:00
karlicoss
657ce08ac8 fix mypy issues after mypy/libraries updates 2023-11-10 22:59:09 +00:00
karlicoss
996169aa29 time.tz.via_location: more consistent behaviour wrt caching
previously it was possible to cachew never properly initialize the cache because if you only queried some dates in the past
because we never made it to the end of _iter_tzs

also some minor cleanup
2023-11-10 22:59:09 +00:00
karlicoss
70bb9ed0c5 location.google_takeout_semantic: handle None visitConfidence 2023-11-10 02:10:30 +00:00
karlicoss
65c617ed94 my.emfit: add missing properties to fake data generator 2023-11-10 02:10:30 +00:00
karlicoss
ac5f71c68b my.jawbone: get rid of matplotlib import on top level 2023-11-10 02:10:30 +00:00
karlicoss
e547acfa59 general: update minimal cachew version
had quite a few useful fixes/performance optimizations since
2023-11-07 21:24:56 +00:00
karlicoss
33f8d867e2 my.browser.export: cleanup
- make logging INFO (default) -- otherwise it's too quiet during processing lots of databases
- can pass inputs cachew directly now
2023-11-07 21:24:56 +00:00
karlicoss
19353e996d my.hackernews.harmonic: use orjson + add __hash__ for Saved object
plus some minor cleanup
2023-11-07 01:03:57 +00:00
karlicoss
4ac3bbb101 my.bumble.android: fix message deduplication 2023-11-07 01:03:57 +00:00
karlicoss
5630621ec1 my.pinboard: some cleanup 2023-11-06 23:10:00 +00:00
karlicoss
7631f1f2e4 monzo.monzoexport: initial module 2023-11-02 00:47:13 +00:00
karlicoss
105928238f vk_messages_backup: some cleanup + switch to get_files 2023-11-02 00:43:10 +00:00
Dima Gerasimov
24da04f142 ci: fix wrong release command 2023-11-01 01:54:16 +00:00
karlicoss
71cb66df5f core: add helper for more_iterable to check that all types involved are hashable
Otherwise unique_everseen performance may degrade to quadratic rather than linear

For now hidden behind HPI_CHECK_UNIQUE_EVERSEEN flag

also switch some modules to use it
2023-10-31 01:02:17 +00:00
Dima Gerasimov
d6786084ca general: deprecate some old methods by hiding behind TYPE_CHECKING 2023-10-30 22:51:31 +00:00
karlicoss
79ce8e84ec fbmessenger.android: support processing msys database
seems that threads_db2 stopped updating some time ago, and msys contains all new data now
2023-10-30 02:54:22 +00:00
karlicoss
f28f68b14b general: enhancle logging for various modules 2023-10-29 22:32:07 +00:00
karlicoss
ea195e3d17 general: improve logging during file processing in various modules 2023-10-29 01:01:30 +01:00
karlicoss
bd27bd4c24 docs: add documentation on logging during HPI module development 2023-10-29 00:50:22 +01:00
karlicoss
f668208bce my.stackexchange.stexport: small cleanup & stat improvements 2023-10-28 21:33:36 +01:00
Dima Gerasimov
6821fbc2fe core/config: implement a warning if config is imported from the dir other than MY_CONFIG
this should help with identifying setup issues
2023-10-28 20:56:07 +01:00
Dima Gerasimov
edea2c2e75 my.kobo: add hightlights method to return Hightlight objects iteratively
also minor cleanup
2023-10-28 20:06:54 +01:00
Dima Gerasimov
d88a1b9933 my.hypothesis: explose data as iterators instead of lists
also add an adapter to support migrating in backwards compatible manner
2023-10-28 20:06:54 +01:00
Dima Gerasimov
4f7c9b4a71 core: move split compat/legacy modules into hpi_compat and compat 2023-10-28 20:06:54 +01:00
karlicoss
70bf51a125 core/stats: exclude contextmanagers from guess_stats 2023-10-28 00:08:32 +01:00
karlicoss
fb2b3e07de my.emfit: cleanup and pass cpu pool 2023-10-27 23:52:03 +01:00
Dima Gerasimov
32aa87b3ec dcotor: make compileall check a bit more defensive 2023-10-27 02:38:22 +01:00
karlicoss
3a25c9042c my.hackernews.dogsheep: use utc datetime + minor cleanup 2023-10-27 02:38:03 +01:00
karlicoss
bef0423b4f my.zulip.organization: use UTC timestamps, support custom archive names + some cleanup 2023-10-27 02:38:03 +01:00
karlicoss
a0910e798d core.logging: ignore CollapseLogsHandler if we're not attached to a terminal
otherwise fails at os.get_terminal_size
2023-10-25 02:42:52 +01:00
Dima Gerasimov
1f61e853c9 reddit.rexport: experiment with using optional cpu pool (used by all of HPI)
Enabled by the env variable, specifying how many cores to dedicate, e.g.

HPI_CPU_POOL=4 hpi query ...
2023-10-25 02:06:45 +01:00
Dima Gerasimov
a5c04e789a twitter.archive: deduplicate results via json.dumps
this speeds up processing quite a bit, from 40s to 20s for me, plus removes tons of identical outputs

interesting enough, using raw object without json.dumps as key brings unique_everseen to crawl...
2023-10-24 01:54:30 +01:00
Dima Gerasimov
0e94e0a9ea whatsapp.andrdoid: handle most messages types properly 2023-10-24 00:31:34 +01:00
Dima Gerasimov
72ab2603d5 my.whatsapp.android: exclude some dummy messages, minor cleanup 2023-10-24 00:31:34 +01:00
Dima Gerasimov
414b88178f tinder.android: infer user's own name automatically 2023-10-24 00:31:34 +01:00
Dima Gerasimov
f355a55e06 my.instagram.gdpr: process all historic archives + better normalising 2023-10-23 18:42:50 +01:00
Dima Gerasimov
f9a1050ceb my.instagram.android: more defensive error handling 2023-10-23 18:42:50 +01:00
karlicoss
86ea605aec core/stats: enable processing input files, report first and last filename
can be useful for quick investigation/testing setup
2023-10-22 00:47:36 +01:00
karlicoss
c335c0c9d8 core/stats: report datetime of first item in addition to last
quite useful for quickly determining time span of a data source
2023-10-22 00:47:36 +01:00
karlicoss
a60d69fb30 core/stats: get rid of duplicated keys for 'auto stats'
previously:
```
{'iter_data': {'iter_data': {'count': 9, 'last': datetime.datetime(2020, 1, 3, 1, 1, 1)}}}
```

after
```
{'iter_data': {'count': 9, 'last': datetime.datetime(2020, 1, 3, 1, 1, 1)}}
```
2023-10-22 00:47:36 +01:00
karlicoss
c5fe2e9412 core.stats: fix is_data_provider when from __future__ import annotations is used 2023-10-21 23:46:40 +01:00
karlicoss
872053a3c3 my.hackernews.harmonic: fix issue with crashing due to html escaping
also add proper logging
2023-10-21 23:46:40 +01:00
karlicoss
37bb33cdbc experimental: add a hacky helper to import "original/shadowed" modules from within overlays 2023-10-21 22:46:16 +01:00
karlicoss
8c2d1c9463 general: use less explicit kompress boilerplate in modules
now get_files/kompress library can handle it transparently
2023-10-20 21:13:59 +01:00
karlicoss
c63e80ce94 core: more consistent handling of zip archives in get_files + tests 2023-10-20 21:13:59 +01:00
Dima Gerasimov
9ffce1b696 reddit.rexport: add accessors for subreddits, multireddits and profile 2023-10-19 02:26:28 +01:00
Dima Gerasimov
29832a9f75 core: fix test_get_files after updating kompress 2023-10-19 02:26:28 +01:00
Dima Gerasimov
28d2450a21 reddit.rexport: some cleanup, move get_events stuff into personal overlay 2023-10-19 02:26:28 +01:00
karlicoss
fe26efaea8 core/kompress: move vendorized to _deprecated, use kompress library directly 2023-10-12 23:47:05 +01:00
karlicoss
bb478f369d core/logging: no need for super call in Filter 2023-10-12 23:47:05 +01:00
karlicoss
68289c1be3 general: fix ignores after mypy version update 2023-10-12 23:47:05 +01:00
Dima Gerasimov
0512488241 ci: sync configs to pymplate
- add python3.12
- add ruff
2023-10-06 02:24:01 +01:00
Dima Gerasimov
fabcbab751 fix mypy errors after version update 2023-10-02 01:27:49 +01:00
Dima Gerasimov
8cd74a9fc4 ci: attempt to use --parallel flag in tox 2023-10-02 01:27:49 +01:00
Sean Breckenridge
f3507613f0 location: make accuracy default config floats
previously they were ints which could possibly
break caching with cachew
2023-10-01 11:52:41 +01:00
Dima Gerasimov
8addd2d58a new module: Harmonic app for Hackernews 2023-09-25 16:36:21 +01:00
Dima Gerasimov
01480ec8eb core/logging: fix issue with logger setup called multiple times when called with different levels
should resolve https://github.com/karlicoss/HPI/issues/308
2023-09-19 22:39:52 +01:00
Sean Breckenridge
be81466871 browser: fix duplicate logs when fetching loglevel 2023-09-15 01:58:45 +01:00
Sean Breckenridge
2a46341ce2 my.core.logging: compatibility with HPI_LOGS
re-adds a removed check for HPI_LOGS, add some docs

fix the checks for browserexport/takeout logs to
use the computed level from my.core.logging
2023-09-07 02:36:26 +01:00
Sean Breckenridge
ff84d8fc88 core/cli: update vendored completion files
update required click version to 8.1
so we dont regenerate the vendored completions
wrong in the future
2023-09-07 00:01:27 +01:00
Dima Gerasimov
c283e542e3 general: fix some issues after mypy update 2023-08-24 23:46:23 +01:00
Dima Gerasimov
642e3b14d5 my.github.gdpr: some minor enhancements
- better error context
- handle some unknown files
- handle user=None in some cases
- cleanup imports
2023-08-24 23:46:23 +01:00
Dima Gerasimov
7ec894807f my.bumble.android: handle more msg types 2023-08-24 23:46:23 +01:00
Sean Breckenridge
fcaa7c1561 core/cli: allow user to bypass PEP 668
when installing dependencies with 'hpi module install',
this now lets a user pass '--break-system-packages' (or '-B'),
which passes the same option down to pip, to allow the user
to bypass PEP 668 and install packages that could possibly
conflict with system packages.
2023-08-10 01:41:43 +01:00
Dima Gerasimov
d6af4dec11 my.instagram.android: minor cleanup + cachew 2023-06-21 20:42:10 +01:00
Dima Gerasimov
88a3aa8d67 my.bluemaestro: minor cleanup 2023-06-21 20:42:10 +01:00
Dima Gerasimov
c25ab51664 core: some tweaks for better colour handling when we're redirecting stdout/stderr 2023-06-21 20:42:10 +01:00
Dima Gerasimov
6f6be5c78e my.hackernews.materialistic: process and merge all db exports + minor cleanup 2023-06-21 20:42:10 +01:00
Dima Gerasimov
dff31455f1 general: switch to make_logger in a few modules, use a bit more consistent logging, rely on default INFO level 2023-06-21 18:42:15 +01:00
Dima Gerasimov
661714f1d9 core/logging: overhaul and many improvements -- mainly to deprecate abandoned logzero
- generally saner/cleaner logger initialization

  In particular now it doesn't override logging level specified by the user code prior to instantiating the logger.

  Also remove the `LazyLogger` hack, doesn't seem like it's necessary when the above is implemented.

- get rid of `logzero` which is archived and abandoned now, use `colorlog` for coloured logging formatter

- allow configuring log level via shell via `LOGGING_LEVEL_module_name=<level>`

  E.g. `LOGGING_LEVEL_rescuexport_dal=WARNING LOGGING_LEVEL_my_rescuetime=debug ./script.py`

- port `AddExceptionTraceback` from HPI/promnesia

- port `CollapseLogsHandler` from HPI/promnesia

  Also allow configuring from the shell, e.g. `LOGGING_COLLAPSE=<level>`

- add support for `enlighten` progress bar, so it can be shared between different projects

  See https://github.com/Rockhopper-Technologies/enlighten#readme

  This allows nice CLI progressbars, e.g. for parallel processing of different files from HPI:

    ghexport.dal[111]  29%|████████████████████████████████████████████████████████████████▏              |  29/100 [00:03<00:07, 10.03 files/s]
    rexport.dal[comments]  17%|████████████████████████████████████▋                                      | 115/682 [00:03<00:14, 39.15 files/s]
    my.instagram.android   0%|▎                                                                           |    3/2631 [00:02<34:50, 1.26 files/s]

  Currently off by default, and hidden behind an env variable (`ENLIGHTEN_ENABLE=true`)
2023-06-21 18:42:15 +01:00
Dima Gerasimov
6aa3d4225e sort out mypy after its update 2023-06-21 03:32:46 +01:00
Dima Gerasimov
ab7135d42f core: experimental import of my._init_hook to configure logging/warnings/env variables 2023-06-21 03:32:46 +01:00
Dima Gerasimov
c12224af74 misc: replace uses of pytz.utc with timezone.utc where it makes sense 2023-06-09 03:31:13 +01:00
Dima Gerasimov
c91534b966 set json files to empty dicts so they are at least valid jsons
(promnesia was stumbling over these, seems like the easiest fix :) )
2023-06-09 03:31:13 +01:00
Dima Gerasimov
5fe21240b4 core: move mcachew into my.core.cachew; use better typing annotations (copied from cachew) 2023-06-08 01:29:49 +01:00
Dima Gerasimov
f8cd31044e general: move reddit tests into my/tests + tweak my.core.cfg to be more reliable 2023-05-26 00:58:23 +01:00
Dima Gerasimov
fcfc423a75 move some tests into the main HPI package 2023-05-26 00:03:24 +01:00
Dima Gerasimov
9594caa1cd general: move most core tests inside my.core.tests package
- distributes tests alongside the package, might be convenient for package users
- removes some weird indirection (e.g. dummy test files improting tests from modules)
- makes the command line for tests cleaner (e.g. no need to remember to manually add files to tox.ini)
- tests automatically covered by mypy (so makes mypy runs cleaner and ultimately better coverage)

The (vague) convention is

- tests/somemodule.py -- testing my.core.somemodule, contains tests directly re
- tests/test_something.py -- testing a specific feature, e.g. test_get_files.py tests get_files methon only
2023-05-25 00:25:13 +01:00
Dima Gerasimov
04d976f937 my/core/pandas tests: fix weird pytest error when constructing dataclass inside a def
can quickly reproduce by running pytest tests/tz.py tests/core/test_pandas.py
possibly will be resolved after fix in pytest?
see https://github.com/pytest-dev/pytest/issues/7856
2023-05-24 22:32:44 +01:00
Dima Gerasimov
a98bc6daca my.core.pandas: rely on typing annotations from types-pandas 2023-05-24 22:32:44 +01:00
Dima Gerasimov
fe88380499 general: switch to using native 3.8 versions for cached_property/Literal/Protocol instead of compat 2023-05-16 01:18:30 +01:00
Dima Gerasimov
c34656e8fb general: update mypy config, seems that logs of type: ignore aren't necessary anymore 2023-05-16 01:18:30 +01:00
Dima Gerasimov
a445d2cbfe general: python3.7 will reach EOL soon, remove its support 2023-05-16 01:18:30 +01:00
seanbreckenridge
7a32302d66
query: add --warn-exceptions, dateparser, docs (#290)
* query: add --warn-exceptions, dateparser, docs

added --warn-exceptions (like --raise-exceptions/--drop-exceptions, but
lets you pass a warn_func if you want to customize how the exceptions are
handled. By default this creates a logger in main and logs the exception

added dateparser as a fallback if its installed (it's not a strong dependency, but
I mentioned in the docs that it's useful for parsing dates/times)

added docs for query, and a few examples

--output gpx respects the --{drop,warn,raise}--exceptions flags, have
an example of that in the docs as well
2023-04-18 00:15:35 +01:00
Sean Breckenridge
82bc51d9fc smscalls: make checking for keys stricter
sort of reverts #287, but also makes some other improvements

this allows us to remove some of the Optional's to
make downstream consumers easier to write. However,
this keeps the return type as a Res (result, with errors),
so downstream consumers will have to handle those incase
the schema ever changes (highly unlikely)

also added the 'call_type/message_type' with a comment
there describing the values

I left 'who' Optional I believe it actually should be -
its very possible for there to be no contact name, added
a check incase its '(Unknown)' which is what my phone
sets it to
2023-04-15 17:17:02 +01:00
seanbreckenridge
40de162fab
cli: add option to output locations to gpx files (#286)
* cli: add option to output locations to gpx files
2023-04-15 00:31:11 +01:00
Sean Breckenridge
02c738594f smscalls: make some fields optional, yield errors
reflects the new types-lxml package
https://github.com/abelcheung/types-lxml
2023-04-14 23:50:26 +01:00
Dima Gerasimov
d464b1e607 core: implement more methods for ZipPath and better support for get_files 2023-04-03 22:58:54 +01:00
Dima Gerasimov
0c5b2b4a09 my.whatsapp.android: initial module 2023-04-01 04:07:35 +01:00
Dima Gerasimov
8288032b1c my.telegram.telegram_backup: support optional extra_where and optional media info extraction for Promnesia 2023-03-27 03:27:13 +01:00
Dima Gerasimov
74710b339a telegram_backup: order messages by date and users/chats by id for determinism 2023-03-27 03:27:13 +01:00
Kian-Meng Ang
d2ef23fcb4 docs: fix typos
found via `codespell -L copie,datas,pres,fo,tooks,noo,ue,ket,frop`
2023-03-27 03:02:35 +01:00
Dima Gerasimov
919c84fb5a my.instagram: better unification of like messages/reactions 2023-03-27 02:16:17 +01:00
Dima Gerasimov
9aadbb504b my.instagram.android: properly extract our own user 2023-03-27 02:16:17 +01:00
Dima Gerasimov
8f7d14e7c6 my.instagram: somewhat mad merging mechanism to correlate gdpr and android exports 2023-03-27 02:16:17 +01:00
Dima Gerasimov
e7be680841 my.instagram.gdpr: handle missing message content defensively 2023-03-27 02:16:17 +01:00
Dima Gerasimov
347cd1ef77 my.fbmessenger: add Sender protocol for consistency 2023-03-17 00:33:22 +00:00
Dima Gerasimov
58d2e25a42 ci: suppress some mypy issues after upgrade 2023-03-17 00:33:22 +00:00
Dima Gerasimov
bef832cbff my.fbmessenger.export: remove legacy dump_chat_history code 2023-03-17 00:33:22 +00:00
Dima Gerasimov
0a05b27266 my.fbmessenger.android: set timezone to utc 2023-03-17 00:33:22 +00:00
Dima Gerasimov
457797bdfb my.bumble.android: better handling for missing conversation id in database 2023-03-17 00:33:22 +00:00
Dima Gerasimov
9db5f318fb my.twitter.twint: use dict row factory instead of sqlite Row
otherwise it's not json serializable
2023-03-17 00:33:22 +00:00
seanbreckenridge
79eeab2128
cli completion doc updates, hide legacy import warning (#279)
* core/cli: hide warnings when autocompleting

* link to completion in setup/troubleshooting
* update completion docs to make source path clear
2023-03-06 21:36:36 +00:00
seanbreckenridge
9d231a8ea9
google_takeout: add semantic location history (#278)
* google_takeout: add semantic location history
2023-03-04 18:36:10 +00:00
Dima Gerasimov
a4c713664e core.logging: sync logging helper with Promnesia, adds more goodies
- print exception traceback by default when using logger.exception
- COLLAPSE_DEBUG_LOGS env variable
2023-03-03 21:14:11 +00:00
Dima Gerasimov
bee17d932b fbmessenger.android: use Optional name, best to leave for the consumer to decide how to behave when it's unavailable
e.g. using  <NAME UNAVAILABLE> was causing issues when used as zulip contact name
2023-03-03 21:14:11 +00:00
Dima Gerasimov
4dfc4029c3 core.kompress: proper support for read_text/read_bytes against zstd/xz archives 2023-03-03 21:14:11 +00:00
Dima Gerasimov
b94904f5ee core.kompress: support .zst extension, seems more conventional than .zstd 2023-03-03 21:14:11 +00:00
Sean Breckenridge
db2cd00bed try removing parallel on mac to prevent CI failure 2023-02-28 20:55:12 +00:00
Sean Breckenridge
a70118645b my.ip.common: remove REQUIRES
no reason to have it there since its
__NOT_HPI_MODULE__, so is not discoverable anyways
2023-02-28 20:55:12 +00:00
Sean Breckenridge
f36bc6144b tox: use my.ip.all, sort hpi installs 2023-02-28 20:55:12 +00:00
Sean Breckenridge
435cb020f9 add example for denylist, update ci 2023-02-28 20:55:12 +00:00
seanbreckenridge
98b086f746
location fallback (#263)
see https://github.com/karlicoss/HPI/issues/262

* move home to fallback/via_home.py
* move via_ip to fallback
* add fallback model
* add stub via_ip file
* add fallback_locations for via_ip
* use protocol for locations
* estimate_from helper, via_home estimator, all.py
* via_home: add accuracy, cache history
* add datasources to gpslogger/google_takeout
* tz/via_location.py: update import to fallback
* denylist docs/installation instructions
* tz.via_location: let user customize cachew refresh time
* add via_ip.estimate_location using binary search
* use estimate_location in via_home.get_location
* tests: add gpslogger to location config stub
* tests: install tz related libs in test env
* tz: add regression test for broken windows dates

* vendorize bisect_left from python src
doesnt have a 'key' parameter till python3.10
2023-02-28 04:30:06 +00:00
Dima Gerasimov
6dc5e7575f vk_messages_backup: add unique_everseen to prevent duplicate messages 2023-02-28 03:55:44 +00:00
Dima Gerasimov
a7099e2efc vk_messages_backup: more correct handling of group chats & better chat ids 2023-02-28 03:55:44 +00:00
Dima Gerasimov
02c98143d5 vk_messages_backup: better structure & exract richer information 2023-02-28 03:55:44 +00:00
Dima Gerasimov
130c273513 my.telegram.telegram_backup enhancements
- add chat handle
- add permalink
- more precise types
2023-02-21 02:44:18 +00:00
Dima Gerasimov
07e7c62d02 general/ci: mypy check tests 2023-02-21 00:20:58 +00:00
Dima Gerasimov
c63177e186 general/ci: clean up mypy-misc pipeline, only exclude specific files instead
marked some module configs which aren't really ready for public use as type: ignore
2023-02-21 00:20:58 +00:00
Dima Gerasimov
eff9c02886 my.fbmessenger.android: add optional facebook_id 2023-02-20 20:14:15 +00:00
Dima Gerasimov
af874d2d75 my.fbmessenger.android: minor refactoring, comments & error handling 2023-02-20 20:14:15 +00:00
Dima Gerasimov
6493859ba5 my.telegram: initial module from telegram_backup 2023-02-19 01:20:38 +00:00
Dima Gerasimov
6594ad24dc my.tinder.android: speedup unique_everseen by adding unsafe_hash 2023-02-19 01:20:38 +00:00
Dima Gerasimov
458633ea96 my.tinder.android: add a bit of logging 2023-02-19 01:20:38 +00:00
Dima Gerasimov
0e884fe166 core/modules: switch away from using override_config to tmp_config in some tests & faka data generators 2023-02-09 02:35:09 +00:00
Dima Gerasimov
5ac5636e7f core: better support for ad-hoc configs
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- https://github.com/karlicoss/promnesia/issues/340
- https://github.com/karlicoss/HPI/issues/46
2023-02-09 02:35:09 +00:00
Dima Gerasimov
fb0c1289f0 my.fbmessenger.export: use context manager to properly close sqlite connection 2023-02-08 02:18:00 +00:00
Dima Gerasimov
bb5ad2b6ac core: make hpi install more defensive, just warn on no requirements
this is useful for backwards compatibility if modules remove their requirements
2023-02-07 01:57:00 +00:00
Dima Gerasimov
5c82d0faa9 switch from using dataset to raw sqlite3 module
dataset is kinda unmaintaned and currently broken due to sqlalchemy 2.0 changes

resolves https://github.com/karlicoss/HPI/issues/264
2023-02-07 01:57:00 +00:00
Dima Gerasimov
9c432027b5 instagram.android: fix missing id 2023-02-07 01:57:00 +00:00
karlicoss
11b6e51c90 ci: fix tox config
seems that after version 4.0 it's necessary to specify environments to run
previosly it was picking them up automatically
2023-01-31 00:31:54 +00:00
Sean Breckenridge
54e6fe6ab5 ci: try disabling parallel pip installs on windows 2022-12-17 21:07:30 +00:00
Sean Breckenridge
ad52e131a0 google.takeout.parser: recreate cache on upgrade
https://github.com/seanbreckenridge/google_takeout_parser/pull/37
2022-12-17 21:07:30 +00:00
Sean Breckenridge
716a2c82ba core/serialize: serialize stdlib Decimal class 2022-10-19 00:07:30 +01:00
Dima Gerasimov
7098d6831f fix mypy in _identity
seems easier to just ignore considering it's "internal" function

also a couple of tests to make sure it infers types correctly
2022-10-19 00:06:23 +01:00
Dima Gerasimov
5f1d41fa52 my.twitter.archive: fix for newer format (tweets filename changed to tweets.js) 2022-10-19 00:06:23 +01:00
Dima Gerasimov
ca91be8154 twitter.archive: fix legacy config detection
apparently .name contains the parent module so previously it was throwing the exception instead
2022-10-19 00:06:23 +01:00
Dima Gerasimov
c8cf0272f9 instagram.gdpr: use new path to personal information 2022-10-19 00:06:23 +01:00
Sean Breckenridge
7925ec81b6 docs: browser - fix examples for config 2022-08-29 00:03:32 +01:00
Dima Gerasimov
119b295d71 core: allow legacy modules to be used in 'hpi module install' for backwards compatibility
but show warning

kinda hacky, but hopefully we will simplify it further when we have more such legacy modules
2022-06-07 22:59:08 +01:00
Sean Breckenridge
dbd15a7ee8 source: propogate help url for config errors 2022-06-07 21:33:38 +01:00
Dima Gerasimov
cef9b4c6d3 ci: try using --parallel install for mypy pipeline
`time tox -e mypy-misc` (removed the actual mypy call)

before (each module in a separate 'hpi install' command)
```
real	1m45.901s
user	1m19.555s
sys	0m5.491s
```

in a single 'hpi install' command (multiple modules)
```
real	1m31.252s
user	1m6.028s
sys	0m5.065s
```

single 'hpi install' command with --parallel
```
real	0m15.674s
user	0m50.986s
sys	0m3.249s
```
2022-06-06 09:49:15 +01:00
Dima Gerasimov
f0397b00ff core/main: experimental --parallel flag for hpi module install 2022-06-06 09:49:15 +01:00
Dima Gerasimov
5f0231c5ee core/main: allow passing multiple packages to 'module install'/'module requires' subcommands 2022-06-06 09:49:15 +01:00
Dima Gerasimov
016f28250b general: initial flake8 checks (for now manual)
fix fairly uncontroversial stuff in my.core like
- line spacing, which isn't too annoying (e.g. unlike many inline whitespace checks that break vertical formatting)
- unused imports/variables
- too broad except
2022-06-05 22:28:38 +01:00
Dima Gerasimov
fd0c65d176 my.tinder: initial module for android databases 2022-06-04 17:16:28 +01:00
Dima Gerasimov
b9d788efd0 some enhancements for facebook/instagram modules
figured out that datetimes are naive
better username handling + investigation of thread names
2022-06-04 17:16:28 +01:00
Sean Breckenridge
7323e99504 zulip: add stats function 2022-06-04 10:04:33 +01:00
Dima Gerasimov
b5f266c2bd my.instagram: add initial all.py + some experiments on nicer errors 2022-06-03 23:49:27 +01:00
Dima Gerasimov
bf3dd6e931 core/sqlite: experiment at typing SELECT query (to some extent)
ideally would be cool to use TypedDict here somehow, but perhaps it'd only be possible after variadic generics https://peps.python.org/pep-0646
2022-06-03 23:49:27 +01:00
Dima Gerasimov
7a1b7b1554 core/general: add assert_never + typing annotations for dataset 2022-06-03 23:49:27 +01:00
Dima Gerasimov
fd1a683d49 my.bumble: merge from all previous android exports 2022-06-02 14:21:21 +01:00
Dima Gerasimov
b96c9f4534 fbmessenger: use both id and timestamp for merging 2022-06-02 14:21:21 +01:00
Dima Gerasimov
3faebdd629 core: add Protocol/TypedDict to compat 2022-06-02 14:21:21 +01:00
Dima Gerasimov
186f561018 core: some cleanup for core/init and doctor; fix issue with compileall 2022-06-02 14:21:21 +01:00
Dima Gerasimov
9461df6aa5 general: extract the hack to warn of legacy imports and fallback to core/legacy.py
use it both in my.fbmessenger and my.reddit

if in the future any new modules need to be switched to namespace package structure with all.py it should make it easy to do

related:
- https://github.com/karlicoss/HPI/issues/12
- https://github.com/karlicoss/HPI/issues/89
- https://github.com/karlicoss/HPI/issues/102
2022-06-01 23:27:34 +01:00
Dima Gerasimov
8336d18434 general: add an adhoc test for checking mixin behaviour with namespace packages and __init__.py hack
also use that hack in my.fbmessenger
2022-06-01 23:27:34 +01:00
Dima Gerasimov
179b657eea general: add a test for __init__.py fallback for modules which are switching to namespace packages
for now just a manual ad-hoc test, will try to set it up on CI later

relevant to the discussion here: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/270465792

also potentially relevant to

- https://github.com/karlicoss/HPI/issues/89 (will try to apply to this to reddit/__init__.py later)
- https://github.com/karlicoss/HPI/issues/102
2022-06-01 23:27:34 +01:00
Dima Gerasimov
049820c827 my.github.gdpr: support uncompressed .tar.gz files
related to https://github.com/karlicoss/HPI/issues/20
2022-05-31 22:16:05 +01:00
Dima Gerasimov
1b4ca6ad1b github.gdpr: prepare for using .tag.gz 2022-05-31 22:16:05 +01:00
Dima Gerasimov
73e57b52d1 general: cleanup -- remove main and executable bit where it's not necessary 2022-05-31 22:16:05 +01:00
Dima Gerasimov
2025d7ad1a general: minor cleanup
- get rid of unnecessary globs in get_files (they should be in config if the user wishes)
- get rid of some old kython imports
- do not convert Path twice in foursquare (so CPath works correctly)
2022-05-31 22:16:05 +01:00
Dima Gerasimov
5799c062a5 my.zulip.organization: use tarfile instead of kopen/kompress
potentially will extract some common interface here like ZipPath

relevant to https://github.com/karlicoss/HPI/issues/20
2022-05-31 14:08:50 +01:00
Dima Gerasimov
4e59a65f9a core/general: move cached_property into compat, use standard implementation from python3.8 2022-05-31 14:08:50 +01:00
Dima Gerasimov
711157e0f5 my.twitter.archive: switch to zippath, add config section, better mypy coverage 2022-05-31 14:08:50 +01:00
Dima Gerasimov
d092608002 twitter.talon: make retweets more compatible with twitter archive 2022-05-31 01:28:11 +01:00
Dima Gerasimov
ef120bc643 twitter.talon: expland URLs 2022-05-31 01:28:11 +01:00
Dima Gerasimov
946daf40d0 twitter: prefer archive data over twidump for tweets
also add a script to check twitter data
2022-05-31 01:28:11 +01:00
Dima Gerasimov
bb4c77612b twitter.twint: fix missing mentions in tweet text 2022-05-31 01:28:11 +01:00
Dima Gerasimov
bb6201bf2d my.twitter.archive: expand entities in tweet text 2022-05-31 01:28:11 +01:00
Dima Gerasimov
1e2fc3bec7 twitter.archive: unescape stuff like &lt/&gt 2022-05-31 01:28:11 +01:00
Dima Gerasimov
44a6b17ec3 twitter: use created_at as an extra key for merging 2022-05-31 01:28:11 +01:00
Dima Gerasimov
4104f821fa twitter.twint: actually need to treat created_at is UTC 2022-05-31 01:28:11 +01:00
Dima Gerasimov
d65e1b5245 twitter.twint: localize timestamps correctly
same issue as discussed here https://memex.zulipchat.com/#narrow/stream/279610-data/topic/google.20takeout.20timestamps

also see corresponding changes for google_takeout_parser

- https://github.com/seanbreckenridge/google_takeout_parser/pull/28/files
- https://github.com/seanbreckenridge/google_takeout_parser/pull/30/files
2022-05-31 01:28:11 +01:00
Dima Gerasimov
de7972be05 twitter: add permalink to Talon objects; extract shared method 2022-05-31 01:28:11 +01:00
Sean Breckenridge
19da373a0a location: remove duplicate via_ip import 2022-05-27 22:48:14 +01:00
Dima Gerasimov
eae0e1a614 my.time.tz.via_location: provide default (empty) config if user doesn't have time config defined 2022-05-22 16:12:44 +01:00
karlicoss
76a497f2bb
general,ci: fix python 3.10 issues, add to CI (#242) 2022-05-03 19:11:23 +01:00
Dima Gerasimov
64a4782f0e core/ci: fix windows-specific issues
- use portable separators
- paths should be prepended with r' (so backwards slash isn't treated as escaping)
- sqlite connections should be closed (otherwise windows fails to remove the underlying db file)
- workaround for emojis via PYTHONUTF8=1 test for now
- make ZipPath portable
- properly use tox python environment everywhere

  this was causing issues on Windows
  e.g.
      WARNING: test command found but not installed in testenv
        cmd: C:\hostedtoolcache\windows\Python\3.9.12\x64\python3.EXE
2022-05-03 10:16:01 +01:00
Dima Gerasimov
637982a5ba ci: update ci configs
- add windows runner
- update actions versions
- other minor enhancements
2022-05-03 10:16:01 +01:00
Maxim Efremov
80c5be7293 Adding bots file type to reduce parsing issues 2022-05-02 08:53:46 +01:00
seanbreckenridge
0ce44bf0d1
doctor: better quick option propogation for stats (#239)
doctor: better quick option propogation for stats

* use contextmanager for quick stats instead of editing global state
  directly
* send quick to lots of stat related functions, so they
could possibly be used without doctor, if someone wanted to
* if a stats function has a 'quick' kwarg, send the value
there as well
* add an option to sort locations in my.time.tz.via_location
2022-05-02 00:13:05 +01:00
Sean Breckenridge
f43eedd52a docs: describe the all.py/import_source pattern 2022-04-27 07:57:16 +01:00
seanbreckenridge
2cb836181b
location: add all.py, using takeout/gpslogger/ip (#237)
* location: add all.py, using takeout/gpslogger/ip, update docs
2022-04-26 21:11:35 +01:00
Sean Breckenridge
66a00c6ada docs: add docs for google_takeout_parser 2022-04-25 02:52:34 +01:00
Dima Gerasimov
78f6ae96d1 my.youtube: use new my.google.takeout.parser module for its data
- fallback on the old logic if google_takeout_parser isn't available
- move to my.youtube.takeout (possibly mixing in other sources later)
- keep my.media.youtube, but issue deprecation warning
  currently used in orger etc, so doesn't hurt to keep
- also fixes https://github.com/karlicoss/HPI/issues/113
2022-04-20 22:22:30 +01:00
Dima Gerasimov
915cfe69b3 kompress.ZipPath: support stat().st_mtime 2022-04-19 21:08:06 +01:00
Dima Gerasimov
f9f73dda24 my.google.takeout.parser: new takeout parser, using https://github.com/seanbreckenridge/google_takeout_parser
adapted from https://github.com/seanbreckenridge/HPI/blob/master/my/google_takeout.py

additions:
- pass my.core.time.user_forced() to google_takeout_parser
  without it, BST gets weird results for me, e.g. US/Aleutian
- support ZipPath via a config switch
- flexible error handling via a config switch
2022-04-16 08:31:40 +01:00
Dima Gerasimov
6e921627d3 compat: workaround for Literal to work in runtime in python<3.8
previously it would crash with:
   SyntaxError: Forward reference must be an expression -- got 'yield'

(reproducible via python3 -c 'from typing import Union; Union[int, "yield"]' )
2022-04-16 08:31:40 +01:00
Dima Gerasimov
382f205429 my.body.sleep: fix issue with attaching temperature
seems that the index operator only works when boundaries are in the dataframe
2022-04-15 19:15:04 +01:00
Dima Gerasimov
599a8b0dd7 ZipPath: support hash, iterdir and proper / operator 2022-04-15 14:24:01 +01:00
Dima Gerasimov
e6e948de9c stackexchange.gdpr: use ZipPath instead of ad-hoc kopen 2022-04-15 12:36:11 +01:00
Dima Gerasimov
706ec03a3f instagram.gdpr: use ZipPath instead of adhoc zipfile methods
this allows using the module more agnostic whether the gpdr archive is packed or unpacked
2022-04-15 12:36:11 +01:00
karlicoss
7c0f304f94
core: add ZipPath encapsulating compressed zip files (#227)
* core: add ZipPath encapsulating compressed zip files

this way you don't have to unpack it first and can work as if it's a 'virtual' directory

related: https://github.com/karlicoss/HPI/issues/20
2022-04-14 10:06:13 +01:00
Sean Breckenridge
444ec1c450 core/source: make help URL configurable 2022-04-10 16:51:15 +01:00
Sean Breckenridge
16c777b45a my.config: catch possible nested config errors 2022-04-10 16:51:15 +01:00
Dima Gerasimov
e750666e30 my.bluemaestro: workaround for weird sensor glitch 2022-03-15 00:00:31 +00:00
Sean Breckenridge
3fd6c81511 pass args to wrapped function 2022-03-15 00:00:12 +00:00
Sean Breckenridge
07b0c0cbef core/source: fix error message, force kwargs for decorator 2022-03-15 00:00:12 +00:00
Sean Breckenridge
6185942f78 core/cli: autocomplete module names 2022-02-23 12:15:00 -01:00
Sean Breckenridge
8b01674fed core/cli: add completion for hpi command 2022-02-16 08:42:26 +00:00
Sean Breckenridge
f1b18beef7 core/structure: use logger, warn leftover files 2022-02-14 19:34:03 +00:00
seanbreckenridge
9e5cd60ff2
browser: parse browser history using browserexport (#216)
* browser: parse browser history using browserexport

from seanbreckenridge/HPI module:
1fba8ccf2f/my/browser/export.py
2022-02-13 23:56:05 +00:00
Sean Breckenridge
059c4ae791 docs: add link to template 2022-02-11 09:33:03 +00:00
Sean Breckenridge
a791b25650 core/cli: add --debug flag, add HPI_LOGS to docs 2022-02-11 09:31:10 +00:00
seanbreckenridge
7bf316eb9a
core/source: use import error (#211)
core/source: use import error

uses the more broad ImportError
instead of ModuleNotFoundError

reasoning being if some submodule
(the one I'm configuring currently is
my.twitter.twint) doesn't have additional
imports from another parser/DAL, but it
still has a config block, the user would
have to create a stub-config block in their
config to use the all.py file
2022-02-10 08:57:52 +00:00
seanbreckenridge
bea2c6a201
core/structure: add partial matching (#212)
* core/structure: add partial matching
2022-02-10 08:49:13 +00:00
Sean Breckenridge
62832a6756 twitter/archive: set default logger to warning 2022-02-09 23:18:24 +00:00
Sean Breckenridge
b6fa26b899 twitter/archive: update deprecated imports 2022-02-09 23:18:24 +00:00
Dima Gerasimov
b9852f45cf twitter: use import_source and proper merging for tweets from different sources
+ use proper datetime_aware for created_at
2022-02-08 20:45:10 +00:00
Dima Gerasimov
afdf9d4334 twitter: initial talon module, processing data from Talon android app 2022-02-08 20:45:10 +00:00
Dima Gerasimov
f8e73134b3 fbmessenger: add all.py, merge messages from different sources
followup for https://github.com/karlicoss/HPI/pull/179
2022-02-08 19:21:44 +00:00
Dima Gerasimov
4626c1bba6 fbmessenger: support config migration for fbmessengerexport source
for now kinda copied from reddit... still thinking about a more generic way
2022-02-05 14:49:12 +00:00
Dima Gerasimov
403ca9c111 fbmessenger: process Android app data
for now, no merging, will figure it out later
2022-02-05 14:49:12 +00:00
Dima Gerasimov
fcd7ca6480 fbmessenger: only import from .export in legacy mode 2022-02-05 14:49:12 +00:00
Dima Gerasimov
f78b12f005 ci: fix pytest.warns type error
use warnings.catch_warnings to suppress instead
https://docs.pytest.org/en/7.0.x/how-to/capture-warnings.html?highlight=warnings#additional-use-cases-of-warnings-in-tests

likely due to pytest update to version 7
2022-02-04 23:38:50 +00:00
Dima Gerasimov
c4ad84ad95 move materialistic module inside hackernews package
followup for https://github.com/seanbreckenridge/HPI/pull/18
2022-02-04 23:38:50 +00:00
Dima Gerasimov
590e09f80b hackernews: add initial dogsheep database importer 2022-02-04 23:38:50 +00:00
Dima Gerasimov
1e635502a2 instagram: initial module for GDPR export
still somewhat WIP, unclear how to correlate it with android data
2022-02-04 00:18:33 +00:00
Dima Gerasimov
0e891a267f doctor: suggest config documentation in case of ImportError from config
doesn't help in all cases but perhaps helpful anyway

relevant: https://github.com/karlicoss/HPI/issues/109
2022-02-02 23:46:46 +00:00
Dima Gerasimov
d1f791dee8 my.fbmessenger: move fbmessenger.py into fbmessenger/export.py
keeping it backwards compatible + conditional warning similar to https://github.com/karlicoss/HPI/pull/179

follow up for https://github.com/seanbreckenridge/HPI/pull/18
for now without the __path__ hacking, will do it in bulk later

too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
2022-02-02 23:22:45 +00:00
Dima Gerasimov
e30953195c instagram: initial module for android app data (direct messages) 2022-02-02 21:50:43 +00:00
Sean Breckenridge
823668ca5c make reddit.rexport logs info by default
can always be configured with HPI_LOGS
having this on debug makes hpi doctor
quite verbose
2022-02-02 00:35:54 +00:00
Dima Gerasimov
7ead8eb4c9 bumble: add initial module for android database 2022-01-30 23:56:24 +00:00
Dima Gerasimov
673ee53a49 my.zulip: add message permalink 2022-01-30 23:33:05 +00:00
Dima Gerasimov
a39b5605ae my.zulip: extract Server/Sender objects, experiment with normalised and denormalised objects 2022-01-30 23:33:05 +00:00
Dima Gerasimov
a1f03f9c02 my.zulip: initial zulip module, parsing full public organization export archive 2022-01-27 22:58:33 +00:00
Dima Gerasimov
73c9e46c4c core: better support for compressed stuff, add .tar.gz 2022-01-27 22:58:33 +00:00
Sean Breckenridge
7493770d4d core: remove vendorized py37 isoformat code 2022-01-27 19:25:42 +00:00
Sean Breckenridge
03dd1271f4 cli/query: add short flags, stream affects pprint
adds some short flags as CLI flags for convenience
the --stream flag previously only affected json, but
I can imagine '-o pprint -s -l 5' to print the first
5 items from some function could be useful as well
2022-01-27 08:50:57 +00:00
Sean Breckenridge
3f4fb64d56
core: drop py36 support, update docs for reddit (#193)
* docs: update references to my.reddit
* ci: remove 3.6, add 3.9
2022-01-27 08:26:15 +00:00
Dima Gerasimov
be21606075 my.reddit: better handling for legacy reddit config
prior to this change it would error with

    @dataclass
>   class pushshift_config(uconfig.pushshift):
E   AttributeError: type object 'test_config' has no attribute 'pushshift'
2021-12-24 18:02:37 +00:00
Dima Gerasimov
5e9cc2a6a0 my.reddit: enable CI tests 2021-12-24 18:02:37 +00:00
Sean Breckenridge
01dfbbd58e use default for getattr instead of catching error 2021-12-19 19:33:31 +00:00
Sean Breckenridge
83725e49dd cli/query: allow querying dynamic functions 2021-12-19 19:33:31 +00:00
Dima Gerasimov
dd928964e6 general: fix mypy errors after mypy and pytz stubs updates
see 968fd6d01d/stubs/pytz/pytz/tzinfo.pyi (L6)
it says all concrete instances should not be None
2021-12-19 18:53:29 +00:00
Dima Gerasimov
9578b13fca my.pdf: handle update to pdfannots 0.2
undoes f5b47dd695 , tests work properly now

resolves https://github.com/karlicoss/HPI/issues/180
2021-12-19 18:53:29 +00:00
Sean Breckenridge
074b8685d6 reddit: pass logger to cachew
so that HPI_LOGS can be used to interact
with this module, to check if cachew
is working properly
2021-12-19 18:25:50 +00:00
Sean Breckenridge
8033b5cdbd docs/reddit: fix code block 2021-12-19 17:20:12 +00:00
Sean Breckenridge
4364484192 docs: add hpi query example, links to other repos
also updated the MODULE_DESIGN docs to mention the
current workaround for converting single file
modules to namespace packages through the
deprecation warning
2021-12-19 17:20:12 +00:00
Sean Breckenridge
d006339ab4 reddit: fix spelling mistakes 2021-11-03 20:18:10 +00:00
Sean Breckenridge
d6c484f321 reddit: ensure rexport isnt pointing to repo 2021-10-31 21:47:10 +00:00
Sean Breckenridge
5d2eadbcc6
reddit: swap inheritance order for Protocol (#183) 2021-10-31 21:24:16 +00:00
Sean Breckenridge
8422c6e420
my.reddit: refactor into module that supports pushshift/gdpr (#179)
* initial pushshift/rexport merge implementation, using id for merging
* smarter module deprecation warning using regex
* add `RedditBase` from promnesia
* `import_source` helper for gracefully handing mixin data sources
2021-10-31 20:39:04 +00:00
Dima Gerasimov
b54ec0d7f1 ci: fix minor mypy complaints from gitpython 2021-10-29 01:41:44 +01:00
Dima Gerasimov
f5b47dd695 ci: temporary suppress pdfs tests so we can pass CI
see https://github.com/karlicoss/HPI/issues/180
2021-10-29 01:41:44 +01:00
Dima Gerasimov
68d77981db ci: update python stuff, exclude 3.6 from osx 2021-10-29 01:41:44 +01:00
Sean Breckenridge
4a04c09f31 docs: fix copy-paste errors/spelling mistakes 2021-07-10 10:56:23 +01:00
Sean Breckenridge
46198a6447
my.core.serialize: simplejson support, more types (#176)
* my.core.serialize: simplejson support, more types

I added a couple extra checks to the default function,
serializing datetime, dates and dataclasses (incase
orjson isn't installed)

(copied from below)

if orjson couldn't be imported, try simplejson
This is included for compatibility reasons because orjson
is rust-based and compiling on rarer architectures may not work
out of the box

as an example, I've been having issues getting it to install
on my phone (termux/android)

unlike the builtin JSON modue which serializes NamedTuples as lists
(even if you provide a default function), simplejson correctly
serializes namedtuples to dictionaries

this just gives another option to people, simplejson is pure python
so no one should have issues with that. orjson is still way faster,
so still preferable if its easy and theres a precompiled build
for your architecture (which there typically is)

If you're ever running this with simplejson installed and not orjson,
its pretty easy to tell as the JSON styling is different; orjson has
no spaces between tokens, simplejson puts spaces between tokens. e.g.

simplejson: {"a": 5, "b": 10}
orjson: {"a":5,"b":10}
2021-07-08 23:02:56 +01:00
Sean Breckenridge
821bc08a23
core/structure: help locate/extract gdpr exports (#175)
* core/structure: help locate/extract gdpr exports

* ci: add install-types to install stub packages
2021-07-08 00:44:55 +01:00
Dima Gerasimov
8ca88bde2e polar: backward compatibility for my.reading.polar 2021-05-29 13:26:01 +01:00
Dima Gerasimov
2a4bddea79 polar: move to top level, add page support 2021-05-29 13:26:01 +01:00
Sean Breckenridge
e8be20dcb5
core: add tmp_dir for global access to a tmp dir (#173)
* core: add tmp_dir for global access to a tmp dir
2021-05-17 00:28:26 +01:00
Sean Breckenridge
b64a11cc69
smscalls: allow multiple backup dirs (#172)
* smscalls: allow multiple backup dirs
* add smscalls to my.config, add test to CI
2021-05-14 01:35:36 +01:00
Sean Breckenridge
014494059d smscalls: add REQUIRES block to install lxml 2021-05-10 19:51:20 +01:00
Sean Breckenridge
43cfb2742f
cli/query: bugfix, convert output to list (#170)
* cli/query: bugfix, convert output to list to keep it backwards compatible
2021-04-28 21:19:49 +01:00
Sean Breckenridge
fa7474c087 cli/query: add --stream flag
allows you to do something like

hpi query --stream my.reddit.comments
to stream the JSON objects one per line, makes
it nicer to pipe into 'jq'/'fzf' instead
of having to process the giant list
at the end
2021-04-28 18:23:16 +01:00
Sean Breckenridge
d71383ddee stats/is_data_provider: ignore 'inputs' func 2021-04-28 18:00:49 +01:00
Dima Gerasimov
68019c80db core/influx: reuse _locate_functions_or_prompt to choose the data provider 2021-04-27 20:10:10 +01:00
Dima Gerasimov
0517f7ffb8 core/influxdb: add main method to create influx measurement and fill with values
allows running something like

    python3 -m my.core.influxdb populate my.zotero
2021-04-27 20:10:10 +01:00
Sean Breckenridge
0278f2b68d cli/query: improve fallback behaviour/error msg 2021-04-24 06:15:59 +01:00
Dima Gerasimov
491bef83bc bluemaestro: make defensive, yield Exception for measurements 2021-04-22 11:11:39 +01:00
Dima Gerasimov
2611e237a3 my.orgmode: add stat function 2021-04-22 11:11:39 +01:00
Dima Gerasimov
393ed0d9ce core: set _max_workers for dummy concurrent pool 2021-04-22 11:11:39 +01:00
Sean Breckenridge
4b4cb7cb5b cli/query: bugfix where datetime was ignored 2021-04-19 20:21:17 +01:00
Sean Breckenridge
277f0e3988
cli/query: interactive fallback, improve guess_stats (#163) 2021-04-19 18:57:42 +01:00
Dima Gerasimov
91eed15a75 my.zotero: extract top level item's tags 2021-04-13 18:05:49 +01:00
Dima Gerasimov
68d3385468 my.zotero: handle colors & extract human readable 2021-04-13 18:05:49 +01:00
Dima Gerasimov
1ef2c5619e my.zotero: initial version 2021-04-13 18:05:49 +01:00
Sean Breckenridge
c1b70cd90e
docs: add some documentation on module design (#160) 2021-04-11 16:53:43 +01:00
Sean Breckenridge
f559e7cb89 my.coding.commits: fix misspelling/add warning 2021-04-07 19:59:27 +01:00
Sean Breckenridge
fb49243005
core: add hpi query command (#157)
- restructure query code for cli, some test fixes
- initial query_range implementation

    refactored functions in query some more
    to allow re-use in range_range, select()
    pretty much just calls out to a bunch
    of handlers now
2021-04-06 17:19:58 +01:00
Dima Gerasimov
b94120deaf core/sqlite: add compat version for backup() for python3.6 2021-04-05 08:37:07 +01:00
Dima Gerasimov
f09ca17560 core/sqlite: move tests to separate module, pickling during Pool.submit can't handle importing :( 2021-04-05 08:37:07 +01:00
Dima Gerasimov
e99e8725b1 core/sqlite: add a helper to do an im-memory snapshot of db including WAL
+ add a bunch of tests for different WAL behaviours
2021-04-05 08:37:07 +01:00
Dima Gerasimov
f2a339f755 core/sqlite: extract immutable connection helper
use in bluemaestro/zotero modules
2021-04-05 08:37:07 +01:00
Sean Breckenridge
349ab78fca
core/cli: switch to using click library #155
everything is backwards-compatible with the previous
interface, the only minor changes were to the doctor cmd
which can now accept more than one item to run,
and the --skip-config-check to skip the config_ok
check if the user specifies to

added a test using click.testing.CliRunner (tests
the CLI in an isolated environment), though
additional tests which aren't testing the CLI
itself (parsing arguments or decorator behaviour)
can just call the functions themselves, as they
no longer accept a argparser.Namespace and instead
accept the direct arguments
2021-04-04 10:06:59 +01:00
Dima Gerasimov
5ef2775265 my.github: some work in progress on generating consistent ids
sadly it seems that there are at several issues:

- gdpr has less detailed data so it's hard to generate a proper ID at times
- sometimes there is a small (1s?) discrepancy between created_at between same event in GDPR an API
- some API events can have duplicate payload, but different id, which violates uniqueness
2021-04-02 20:09:53 +01:00
Dima Gerasimov
386234970b my.github.ghexport: handle more event types, more consisten body handling 2021-04-02 20:09:53 +01:00
Dima Gerasimov
b306ccc839 core: add ensure_unique iterator transfromation 2021-04-02 20:09:53 +01:00
Sean Breckenridge
c31641b4eb
core: discovery_pure; allow multiple package roots (#152)
* core: discovery_pure; allow multiple package roots

iterates over my.__path__._path if possible
to discover additional paths to iterate over

else defaults to the path relative to
the current file
2021-04-02 15:46:18 +01:00
Sean Breckenridge
5ecd4b4810 cleanup; remove unused imports 2021-04-02 08:38:06 +01:00
Sean Breckenridge
a11a3af597 commits: reduce possibility of path conflicts 2021-04-02 07:20:35 +01:00
Dima Gerasimov
edf6e5d50b my.pdfs: rely on pdfannots for created date extraction/parsing 2021-04-01 17:27:06 +01:00
Dima Gerasimov
ad177a1ccd my.pdfs: cleanup/refactor
- modernize:
  - add REQUIRES spec for pdfannots library
  - config dataclass/config stub
  - stats function
  - absolute my.core imports in anticipation of splitting core
- use 'paths' instead of 'roots' (better reflects the semantics), use get_files
  backward compatible via config migration
- properly run tests/mypy
2021-04-01 17:27:06 +01:00
Dima Gerasimov
e7604c188e my.pdfs: reorganize tests a bit, fix mypy 2021-04-01 17:27:06 +01:00
Dima Gerasimov
5c38872efc core: add DummyExecutor to make it easier to debug concurrent code with Pools 2021-04-01 17:27:06 +01:00
Sean Breckenridge
3118891c03
my.core.query: initial implementation (#143)
in particular `my.core.query.select`: a function to query, order, sort and filter items from one or more sources
2021-03-28 07:52:50 +01:00
Sean Breckenridge
d47f3c28aa my.core.serialize: support serializing Paths 2021-03-28 06:55:03 +01:00
Sean Breckenridge
1b36bd4379 fix spelling mistakes 2021-03-28 06:53:24 +01:00
Dima Gerasimov
29384aef44 my.goodreads: cleanup, rename from my.reading.goodrads & use proper pip dependency
related:
- https://github.com/karlicoss/HPI/issues/79
- 10d8cc86a1
2021-03-26 05:06:53 +00:00
Sean Breckenridge
1cdef6f40a fix mypy errors
this fixes two distinct mypy errors

one where NamedTuple/dataclassees can't be
defined locally
https://github.com/python/mypy/issues/7281

which happens when you run mypy like
mypy -p my.core on warm cache

the second error is the core/types.py file shadowing the
stdlib types module
2021-03-22 06:34:07 +00:00
ddrone
2bbbff0021 Update README.org 2021-03-21 15:16:21 +00:00
ddrone
8e868391fd Update README.org
*muffled Yung Lean playing in the distance*
2021-03-21 15:16:21 +00:00
Sean Breckenridge
eb26cf8633
my.core.serialize: orjson with additional default and _serialize hook (#140)
basic orjson serialize, json.dumps fallback

Lots of surrounding changes from this discussion:
0593c69056
2021-03-20 00:48:03 +00:00
Sean Breckenridge
02a9fb5e8f github.gdpr: parse project files
also fixed a typo in commit_comments
2021-03-15 12:40:22 +00:00
Dima Gerasimov
a1a24ffbc3 my.coding.commits: more cleanup
Followup of https://github.com/karlicoss/HPI/pull/132

- add REQUIRES section
- use 'commits' config section & add proper schema
- use dedicated subdirectory for cache
2021-03-15 10:33:46 +00:00
Dima Gerasimov
ec8b0e9170 my.coding.commits: actually test on CI, add config stub 2021-03-15 10:33:46 +00:00
Dima Gerasimov
8d6f691824 core: feature: guess module stats from typing annotations 2021-03-15 10:27:18 +00:00
Sean Breckenridge
4db81ca362
cleanup coding.commits (#132)
* cleanup coding.commits

remove the _things check, its never activated
for me and seems pointless

update mechanism for finding fdfind/fd path,
send a core.warning if it fails

update mechanism to cache repos to new cachew
api (remove hashf), cache repos on a per-repo
basis
2021-03-15 03:55:28 +00:00
jon r
44b893a025 add similar projects to README 2021-03-15 03:50:26 +00:00
Sean Breckenridge
fb9426d316 smscalls: add config block
so that don't have to infer what
to set in your hpi config based
on usage in module
2021-03-15 03:48:42 +00:00
Dima Gerasimov
c83bfbd21c ci: enable pull_request trigger 2021-03-10 07:07:38 +00:00
Dima Gerasimov
ce157c47cc google.takeout: support another reported time format
https://github.com/karlicoss/HPI/issues/114
2021-03-08 00:40:19 +00:00
Dima Gerasimov
1fd2a9f643 core/time: more flexible support for resolving TZ abbreviation -> TZ ambiguities
addresses https://github.com/karlicoss/HPI/issues/103

for now via experimental time.tz.force_abbreviations config variable
not sure if this whole things is doomed to be resolved properly
2021-03-08 00:40:19 +00:00
Dima Gerasimov
5ef638694e minor requirements updates 2021-03-08 00:40:19 +00:00
Dima Gerasimov
0585cc4a89 arbtt: feed data to influxdb 2021-02-25 19:56:35 +00:00
Dima Gerasimov
ca4d58e4e7 core: add helper to 'freeze' dataclasses, in order to derive a schema from the properties 2021-02-25 19:56:35 +00:00
Dima Gerasimov
86497f9b13 new: basic arbtt module 2021-02-25 19:56:35 +00:00
Dima Gerasimov
ad924ebca8 refresh readme, reflect blog post changes 2021-02-22 13:29:05 +00:00
Dima Gerasimov
20585a3130 influxdb: WIP on magic automatic interface
to run:

    python3 -c 'import my.core.influxdb as I; import my.hypothesis as H; I.magic_fill(H.highlights)'
2021-02-22 10:46:40 +00:00
Dima Gerasimov
bfec6b975f influxdb: add helper to core + use it in bluemaestro/lastfm/rescuetime 2021-02-22 10:46:40 +00:00
Dima Gerasimov
271cd7feef core/cachew: use cache_dir in mcachew if it wasn't specified by the user 2021-02-21 19:51:58 +00:00
Dima Gerasimov
3e821ca7fd my.github.ghexport: get rid of custom cache_dir 2021-02-21 19:51:58 +00:00
Dima Gerasimov
9afe1811a5 core/cachew: special handling for None in order to preserve cache_dir() path
+ add 'suffix' argument for more straighforward logic
2021-02-21 19:51:58 +00:00
Dima Gerasimov
da3c1c9b74 core/cachew: rely on ~/.cache for default cache path
- rely on appdirs for default cache path instead of hardcoded /var/tmp/cachew
  technically backwards incompatible, but no action needed
  you might want to clean /var/tmp/cachew after updating

- use default cache path (e.g. ~/.cache) by default
  see https://github.com/ActiveState/appdirs#some-example-output for more info
  *warning*: things will be cached by default now (used to be uncached before)

- treat cache_dir = None in the config
  *warning*: kind of backwards incompatible.. but again nothing disasterous
2021-02-21 19:51:58 +00:00
Dima Gerasimov
837ea16dc8 add changelog 2021-02-20 02:06:53 +00:00
Dima Gerasimov
ad1cc71b0f readme: update 2021-02-20 01:39:55 +00:00
Dima Gerasimov
3b4a2a378f core: make discovery even more static, has_stats via ast + tests 2021-02-19 02:39:25 +00:00
Dima Gerasimov
f90599d7e4 core: make discovery rely on ast module more, add test 2021-02-19 02:39:25 +00:00
Dima Gerasimov
a3305677b2 core: deprecate my.cfg, instead my.config can (and should be) used directly 2021-02-19 02:39:25 +00:00
Dima Gerasimov
ddbb2e5f23 CI: better cleanup for modules in between tests 2021-02-19 02:39:25 +00:00
Dima Gerasimov
94ace823e0 tests: cleanup location/tz tests 2021-02-19 02:39:25 +00:00
Dima Gerasimov
82e2f96192 core: add test for tmp_config; unset new attributes 2021-02-19 02:39:25 +00:00
Dima Gerasimov
5313984d8f core: add tmp_config helper for test & adhoc patching
bluemaestro: cleanup tests
2021-02-19 02:39:25 +00:00
Dima Gerasimov
42399f6250 pinboard: *breaking backwards compability*, use pinbexport module directy
Use 'hpi module install my.pinboard' to install it

relevant: https://github.com/karlicoss/HPI/issues/79
2021-02-18 20:46:03 +00:00
Dima Gerasimov
0534c5c57d cli: add 'hpi module install' and 'hpi module requires'
ci: use hpi module install; remove explicit module links

relevant:

- https://github.com/karlicoss/HPI/issues/12
- https://github.com/karlicoss/HPI/issues/79
2021-02-18 02:04:40 +00:00
Dima Gerasimov
97650adf3b core: add discovery_pure module to get modules and their dependencies via ast module 2021-02-18 02:04:40 +00:00
Dima Gerasimov
4ad4f34cda core: improve mypy coverage 2021-02-18 02:04:40 +00:00
Dima Gerasimov
56d5587c20 CI: clean up tox config a bit, get rid of custom lint script 2021-02-18 02:04:40 +00:00
Dima Gerasimov
f102101b39 core/windows: fix get_files and its tests 2021-02-16 06:40:42 +00:00
Dima Gerasimov
6d9bc2964b bluemaestro: populate grafana 2021-02-15 00:15:44 +00:00
Dima Gerasimov
1899b006de bluemaestro: investigation of data quality + more sanity checks 2021-02-15 00:15:44 +00:00
Dima Gerasimov
746c3da0ca core.pandas: allow specifying schema; add tests 2021-02-15 00:15:44 +00:00
Dima Gerasimov
d77ab92d86 bluemaesto: get rid of unnecessary file, move to top level 2021-02-15 00:15:44 +00:00
Dima Gerasimov
d562f00dca tests: run all tests, but exclude tests specific to my computer from CI
controllable via HPI_TESTS_KARLICOSS=true
2021-02-14 17:47:18 +00:00
Dima Gerasimov
6239879245 core: add more tests for stat/datetime guessing 2021-02-14 16:20:38 +00:00
Dima Gerasimov
4012f9b7c2 core: more generic functions to jsonify data, rescuetime: fix influxdb filling 2021-02-14 16:20:38 +00:00
Dima Gerasimov
07f901e1e5 core: helpers for automatic dataframes from sequences of NamedTuple/dataclass
also use in my.rescuetime
2021-02-14 16:20:38 +00:00
Dima Gerasimov
df9a7f7390 core.pandas: add check for 'error' column + add empty one by default 2021-02-14 16:20:38 +00:00
Dima Gerasimov
3a1e21635a my.body.blood: use same file
overall I guess this module is highly specific to me anyway, so gonna be
hard to make it generic...
2021-02-14 16:20:38 +00:00
Dima Gerasimov
5b501d1562 extract combined exercise module 2021-01-11 20:42:23 +00:00
Dima Gerasimov
6b451336ed Initial parser for RunnerUp data which I'm now using instead of Endomondo 2021-01-11 20:42:23 +00:00
Dima Gerasimov
e81dddddf0 core: proprely resolve class properties in make_config + add test 2020-12-13 18:29:49 +01:00
Dima Gerasimov
dda628e866 CI: fix extras_require after dependency resolver update
https://github.com/pypa/pip/issues/8940
2020-12-11 07:02:16 +01:00
Dima Gerasimov
571cb48aea core: add modules_ast for more robust module collection 2020-12-11 07:02:16 +01:00
Dima Gerasimov
63c825ab81 my.stackexchange: use GDPR data for votes 2020-12-11 07:02:16 +01:00
Dima Gerasimov
ddea816a49 my.stackexchange: use proper pip package, add stat
+ 'anonymous' mode for stat() function
2020-12-11 07:02:16 +01:00
Rosano
9d39892e75 Add remoteStorage
Similar to Solid, you can put data in a space you control at the outset. Also there is https://0data.app for apps that integrate with Solid and remoteStorage.
2020-12-09 03:06:41 +01:00
Dima Gerasimov
8abe66526d my.photos: minor fixes/configcleanup + speedup 2020-11-25 04:47:30 +01:00
Dima Gerasimov
f8db8c7b98 ci: update github CI config 2020-11-23 20:46:26 +01:00
Dima Gerasimov
29ad315578 core/cli: some enhacements for frendlier errors/default config
see https://github.com/karlicoss/HPI/issues/110
2020-11-23 20:46:26 +01:00
Dima Gerasimov
a6e5908e6d get rid of porg dependency, use orgparse directly 2020-11-06 23:02:35 +01:00
Dima Gerasimov
62e1bdc39a my.fbmessenger: use pip package
https://github.com/karlicoss/HPI/issues/79
2020-11-06 23:02:35 +01:00
Dima Gerasimov
ed4b6c409f doctor & stat improvements
- doctor: log when there is no stats() function and suggest to use it
- check that stats() result isn't None
- warn when there is not data in the iterable
2020-11-01 02:02:43 +01:00
Dima Gerasimov
2619af7ae7 doctor: warn if the default config is used 2020-11-01 02:02:43 +01:00
Dima Gerasimov
f40b804833 update setup documentation 2020-11-01 02:02:43 +01:00
Dima Gerasimov
1849a66f08 general: get rid of example_config & use demo/stub my.config instead 2020-11-01 02:02:43 +01:00
Dima Gerasimov
96be32aa51 mypy-friendly compat functions handling 2020-11-01 01:51:10 +01:00
Dima Gerasimov
3a9e3e080f my.time.tz: implement different policies for localizing 2020-11-01 01:51:10 +01:00
Dima Gerasimov
15789a4149 kyhton.kompress: move to core (with a fallback, used in promnesia) 2020-10-29 03:13:18 +01:00
Dima Gerasimov
655b86bb0a my.kython.konsume: move to core 2020-10-29 03:13:18 +01:00
Dima Gerasimov
cc127f1876 kython.klogging
- move to core
- add a proper description why it's useful
- make default level INFO
- use HPI_LOGS variable for easier log level control (abdc6df1ea)
2020-10-29 03:13:18 +01:00
Dima Gerasimov
a946e23dd3 core.pandas: dump the timezones in check_dateish 2020-10-21 01:29:29 +02:00
Dima Gerasimov
831fee42a1 core: minor error handling tweaks 2020-10-21 01:29:29 +02:00
Dima Gerasimov
2a2478bfa9 core: update cachew annotations
orgmode: expose method to construct cacheable note
2020-10-21 01:29:29 +02:00
Dima Gerasimov
bdfac96352 core.error: more generic sort_res_by 2020-10-21 01:29:29 +02:00
Dima Gerasimov
fa5e181cf8 core: minor helpers for error handling 2020-10-21 01:29:29 +02:00
Dima Gerasimov
d059e4aaf4 add taplog provider 2020-10-21 01:29:29 +02:00
Dima Gerasimov
ed47e98d5c my.body.sleep: integrate with optional temperature data 2020-10-12 21:48:04 +02:00
Dima Gerasimov
725597de97 add my.body.sleep, combine together emfit/jawbone 2020-10-12 21:48:04 +02:00
Dima Gerasimov
e8e4994c02 google.takeout.paths: return Optional if there are no takeouts 2020-10-12 21:48:04 +02:00
Dima Gerasimov
4666378f7e my.location.home: simplify config format, make it a bit more robust + tests 2020-10-12 09:05:11 +02:00
Dima Gerasimov
d8ed780e36 my.orgmode: cache entries 2020-10-11 18:44:37 +02:00
Dima Gerasimov
1ded99c61c my.notes.orgmode: move to my.orgmode 2020-10-11 18:44:37 +02:00
Dima Gerasimov
649537deca my.notes.orgmode: make a bit more iterative 2020-10-11 18:44:37 +02:00
Dima Gerasimov
6a1a006202 core: add DataFrame support to stat 2020-10-11 18:44:37 +02:00
Dima Gerasimov
209cffb476 doctor: print import order 2020-10-09 23:22:00 +02:00
Dima Gerasimov
96113ad5ae my.calendar.holidays: unhardcode calendar, detect it from the location data 2020-10-09 23:22:00 +02:00
Dima Gerasimov
bdb5dcd221 my.calendar.holidays: cleanup + ci/stats + split off private data handling to https://github.com/karlicoss/hpi-personal-overlay 2020-10-09 23:22:00 +02:00
Dima Gerasimov
1f9be2c236 fix after mypy version update 2020-10-09 22:09:19 +02:00
Dima Gerasimov
35b91a6fa2 timezone provider: add stat(), use cachew (daily resolution) 2020-10-09 22:09:19 +02:00
Dima Gerasimov
dfea664f57 add my.location.home, use it as location/timezone fallback 2020-10-09 22:09:19 +02:00
Dima Gerasimov
1f2e595be9 Initial my.time.tz provider, infer from location with daily resolution 2020-10-09 22:09:19 +02:00
Dima Gerasimov
dc2518b348 my.location.google: cleanup old stuff related to tagging, definitely doesn't belong to this module 2020-10-08 21:31:26 +02:00
Dima Gerasimov
ba9acc3445 my.location: let takeout provider be in a separate my.location.google; add CI test & enable mypy 2020-10-08 21:31:26 +02:00
Dima Gerasimov
90ada92110 bluemaestro: include humidity, pressure and dewpoint data 2020-10-08 21:22:02 +02:00
Dima Gerasimov
ced93e6942 reflect cachew changes of exception handling and temporary suppression 2020-10-08 21:22:02 +02:00
Dima Gerasimov
d3f2551560 core.pandas: check index in check_dataframe 2020-10-04 01:40:52 +02:00
Dima Gerasimov
5babbb44d0 my.bluemaestro: workaround weird timestamps by keeping track of the latest timestamp 2020-10-04 01:40:52 +02:00
Dima Gerasimov
8e8d9702f3 my.bluemaestro: investigation of weird timestamps 2020-10-04 01:40:52 +02:00
Dima Gerasimov
6242307d7a my.bluemaestro: run against testdata, add on CI 2020-10-04 01:40:52 +02:00
Dima Gerasimov
e63c159b80 my.body.exercise: add more annotations & ci check 2020-10-03 18:24:08 +02:00
Dima Gerasimov
06ee72bc30 core: more type annotations 2020-10-03 18:24:08 +02:00
Sean Breckenridge
44b756cc6b smscalls: use stdlib for tz, attach readable date
pytz is overkill for this, use the builin
datetime.timezone module (since py ver 3.2)

attach the readable datetime
like 'Sep 12, 2020 9:12:19 AM' to each
of the calls/messages
2020-10-02 19:11:48 +02:00
Sean Breckenridge
160582b6cf parse sms messages from xml files 2020-10-02 19:11:48 +02:00
Dima Gerasimov
d8841d0d7a my.endomondo: add fake data generator, test mypy 2020-10-02 00:37:08 +02:00
Dima Gerasimov
1c20eb27aa CI: add mypy checks for my.reddit, my.pocket and my.github.ghexport 2020-09-30 23:33:06 +02:00
Dima Gerasimov
0682919449 general: use module dependencies as proper PIP packages + fallback 2020-09-30 23:33:06 +02:00
Sean Breckenridge
c68d81a8ca supress conflicting regex warning 2020-09-30 22:58:48 +02:00
Dima Gerasimov
ed25fc2eeb cli: tabulate warnings for cleaner visual output; add --quick flag for doctor 2020-09-30 21:54:09 +02:00
Dima Gerasimov
fd41caa640 core: add __NOT_HPI_MODULE__ flag to mark utility files etc
(more of an intermediate solution perhaps)
2020-09-30 21:54:09 +02:00
Dima Gerasimov
3b9941e9ee cli: add --all for doctor/modules command 2020-09-30 21:54:09 +02:00
Dima Gerasimov
4b49add746 core: more consistent module detection logic 2020-09-30 21:54:09 +02:00
Dima Gerasimov
c79ffb50f6 core: add tests for core_config 2020-09-30 21:54:09 +02:00
Dima Gerasimov
70c801f692 core: add 'core' config section, add disabled_modules/enabled_modules configs, use them for hpi modules and hpi doctor 2020-09-30 21:54:09 +02:00
Dima Gerasimov
f939daac99 ci: upload mypy coverage artifacts 2020-09-29 20:43:34 +02:00
Dima Gerasimov
dc642b5a6d my.instapaper: add stat; add mypy checks on CI 2020-09-29 20:43:34 +02:00
Dima Gerasimov
3404b3fcf1 my.instapaper: use instapexport from PIP package 2020-09-29 20:43:34 +02:00
Dima Gerasimov
24fb983399 ci: add mypy for my.hypothesis 2020-09-29 19:44:45 +02:00
Dima Gerasimov
6199ed7916 my.hypothesis: better mypy coverage 2020-09-29 19:44:45 +02:00
Dima Gerasimov
deefa9fbbc Use hypexport package in demo.py, clean up tox 2020-09-29 19:44:45 +02:00
Dima Gerasimov
abbaa47aaf core.warnings: handle stacklevel properly
add more warnings about deprecated config arguments
2020-09-29 19:44:45 +02:00
Dima Gerasimov
109edd9da3 general: add compat module and helper for easy backwards compatibiltity for pre-PIP dependencies
my.hypothesis: use hypexport as a proper PIP package + fallback
2020-09-29 19:44:45 +02:00
Dima Gerasimov
fbaa8e0b44 core: add warnings helper to highlight warnings so they are more visible in the output 2020-09-27 17:47:30 +02:00
Dima Gerasimov
cd40fc75c3 my.emfit: expose fake data contextmanager 2020-09-19 18:10:16 +01:00
Dima Gerasimov
f02c572cc0 body.exercise: add cardio summary, move cross trainer to a separate file 2020-09-19 18:10:16 +01:00
Dima Gerasimov
eb14d5988d my.body.exercise: more robuse handling + handle mismatching timezones 2020-09-19 18:10:16 +01:00
Dima Gerasimov
afce09d1d4 my.body.exercise: more consistent merging for cross trainer data 2020-09-19 18:10:16 +01:00
Dima Gerasimov
1ca2d116ec my.body.exercise: cleanup & error handling for merging cross trainer stuff 2020-09-19 18:10:16 +01:00
Dima Gerasimov
0b947e7d14 my.body.exercise: port code from private workouts provider, simplify 2020-09-19 18:10:16 +01:00
Dima Gerasimov
baac593aef port endomondo data provider from my private package 2020-09-19 18:10:16 +01:00
Dima Gerasimov
28fcc1d9b6 my.rescuetime: use rescuexport directly, add error handling & dataframe 2020-09-18 23:50:40 +01:00
Dima Gerasimov
e34c04ebc8 core.cachew: make disabled_cachew defensive 2020-09-18 23:50:40 +01:00
Dima Gerasimov
ef72ac3386 core: add initial config hacking helper
rescuetime: initial fake data generator
2020-09-18 23:50:40 +01:00
Dima Gerasimov
132db1dc0c core: add pandas utils 2020-09-17 21:39:14 +01:00
Dima Gerasimov
63b848087d my.jawbone: minor cleanup & refactoring, proper error propagation 2020-09-17 21:39:14 +01:00
Dima Gerasimov
99e50f0afe core: experiments with attaching datetime information to errors
my.body.weigth: put datetime information in the error rows
2020-09-09 21:37:15 +01:00
Dima Gerasimov
743312a87b my.body.blood: prettify, add stat() 2020-09-09 21:37:15 +01:00
Dima Gerasimov
efea669a3e my.location: some cleanup and speedups 2020-09-09 21:37:15 +01:00
Dima Gerasimov
65781dd152 emfit: patch up timezone for correct local sleep time 2020-09-09 21:37:15 +01:00
Dima Gerasimov
d9bbf7cbf0 emfit: propagate errors properly, expose dataframe 2020-09-09 21:37:15 +01:00
Sean Breckenridge
78489157a1 fix spelling mistakes 2020-09-06 20:44:28 +01:00
Dima Gerasimov
07dd61ca6a my.emfit: move data access layer bits to emfitexport 2020-08-20 21:30:52 +01:00
Dima Gerasimov
975f9dd110 rescuetime: get rid of kython, use cachew 2020-08-20 21:30:52 +01:00
Dima Gerasimov
6515d1430f core: experimental guessing for last objects' date 2020-08-20 21:30:52 +01:00
karlicoss
cde5502151
Merge pull request #74 from thetomcraig/pdfs-process-filelist
Add "filelist" parameter to annotated_pdfs
2020-08-20 21:08:57 +01:00
Adrien Lacquemant
5b2cc577f2 Correct command to create config file 2020-08-20 20:14:55 +01:00
Tom Craig
5dc62ff085 Add tests for pdfs 2020-08-16 13:36:36 -07:00
Tom Craig
882ceb62fc Add a "filelist" paramter to annotated_pdfs 2020-08-16 12:57:20 -07:00
Dima Gerasimov
626ee994bf twint: open database in read only mode 2020-07-31 12:22:13 +01:00
Dima Gerasimov
4920defe12 vk: add messages processing 2020-07-31 12:22:13 +01:00
Dima Gerasimov
c54d85037c core: add base cachew directory 2020-07-31 12:22:13 +01:00
Dima Gerasimov
10a8ebaae4 vk: move favorites module to a subpackage, add stat 2020-07-30 22:37:21 +01:00
Dima Gerasimov
4ee89b85ee reddit: add stats() 2020-07-30 22:37:21 +01:00
Dima Gerasimov
a9ae6dbb7f core: add error count to stats helper 2020-07-30 22:37:21 +01:00
Dima Gerasimov
92307d5f3d bluemaestro: support new databases as well 2020-07-28 20:32:35 +01:00
Dima Gerasimov
9d45eb0559 bluemaestro: make iterative, add stat() 2020-07-28 20:32:35 +01:00
Tom Craig
fdaae59b59 Add .get to call for d[date] 2020-07-27 21:33:44 +01:00
Dima Gerasimov
092aef88ce core: detect compression, wrap in CPath if necessary 2020-07-26 21:31:26 +01:00
Dima Gerasimov
77deef98de reddit: more consistent handling for events 2020-07-26 21:31:26 +01:00
Dima Gerasimov
031b1278eb reddit: cleanup cachew wrapper a bit 2020-07-26 21:31:26 +01:00
Dima Gerasimov
6b548c24c1 doctor: better mypy detection 2020-07-26 21:31:26 +01:00
Dima Gerasimov
5eecd8721d cli: check specific module with doctor; print help on no command 2020-07-06 21:40:41 +01:00
Dima Gerasimov
49d25a75ae core: use immutable mode in dataset helper 2020-07-06 21:40:41 +01:00
Dima Gerasimov
4fc33a9ed2 core: add helper for opening read-only database 2020-07-06 21:40:41 +01:00
karlicoss
0bcc5952c7
Merge pull request #62 from karlicoss/updates
updates: core & kobo
2020-06-04 22:55:15 +01:00
Dima Gerasimov
821eb47c93 kobo: BREAKING changes. Use kobuddy module directly, rename export_dir to export_path.
Hopefully this makes a lot of sense in the first place, and not that many users, so deserves breaking.
2020-06-04 22:50:52 +01:00
Dima Gerasimov
db852b3927 kobo: move away from my.books 2020-06-04 22:20:48 +01:00
Dima Gerasimov
1cc4eb5d8d core: add helper for computing stats; use it in modules 2020-06-04 22:19:34 +01:00
karlicoss
a94b64c273
Merge pull request #61 from karlicoss/updates
github module: cleanup and proper modular layout
2020-06-01 23:52:07 +01:00
Dima Gerasimov
3d7844b711 core: support '' for explicitly set empty path set 2020-06-01 23:45:26 +01:00
Dima Gerasimov
a267aeec5b github: add config templates + docs
- ghexport: use export_path (export_dir is still supported)
2020-06-01 23:33:34 +01:00
Dima Gerasimov
ca39187c63 github: DEPRECATE my.coding.github
Instead my.github.all should be used (still backward compatible)

The reasons are
a) I don't feel that grouping (i.e. my.coding.*) makes much sense
b) using .all pattern (same way as twitter) allows for more composable and cleaner separation of GDPR and API data
2020-06-01 22:49:31 +01:00
Dima Gerasimov
d7aff1be3f github: start moving to a proper artbitrated module 2020-06-01 22:49:31 +01:00
Matthew Reishus
67cf4d0c04 my.coding.github ignores some events emitted by bots.
I use a service called dependabot ( https://dependabot.com/ ).  It
automatically creates pull requests in my repositories to upgrade
dependencies.  The modern front end javascript world moves really
quickly; projects have a ton of dependencies that are updating all the
time, so there are a lot of these pull requests.

Also, the PRs it makes have a lot of info in them.  Here's an example
one: https://github.com/mreishus/spades/pull/180 .  If you hit the
arrows, you can see it includes a lot of text in "Changelog" and
"Commits".  Now check out the list of closed PRs this project has:
https://github.com/mreishus/spades/pulls?q=is%3Apr+is%3Aclosed

Once I got everything working with my.coding.github, my Github.org
(using orger) was huge: 5MB.  I wanted to get rid of the dependabot
stuff, since it's mostly junk I'm not too interested it, and I got it
down to 130K (from 5MB) just from this commit.

Here's an example of an event I'm filtering out:
I'm looking to see if the "user" contains a "[bot]" tag in it.

  {
    "type": "pull_request",
    "url": "https://github.com/mreishus/spades/pull/96",
    "user": "https://github.com/dependabot-preview[bot]",
    "repository": "https://github.com/mreishus/spades",
    "title": "Bump axios from 0.19.1 to 0.19.2 in /frontend",
    "body": "Bumps [axios](https://github.com/axios/axios) from 0.19.1 to 0.19.2.\n<details>\n<summary>Release notes</summary [cut 5000 characters]
    "base": {
      "ref": "master",
      "sha": "a47687762887c6e5c0d5d0a38c3c9697f09cbcd6",
      "user": "https://github.com/mreishus",
      "repo": "https://github.com/mreishus/spades"
    },
    "head": {
      "ref": "dependabot/npm_and_yarn/frontend/axios-0.19.2",
      "sha": "0e79d0220002cb54cd40e13a40addcc0d0a01482",
      "user": "https://github.com/mreishus",
      "repo": "https://github.com/mreishus/spades"
    },
    "assignee": "https://github.com/mreishus",
    "assignees": [
      "https://github.com/mreishus"
    ],
    "milestone": null,
    "labels": [
      "https://github.com/mreishus/spades/labels/dependencies",
      "https://github.com/mreishus/spades/labels/javascript"
    ],
    "review_requests": [

    ],
    "work_in_progress": false,
    "merged_at": null,
    "closed_at": "2020-01-25T14:40:27Z",
    "created_at": "2020-01-22T13:37:17Z"
  },

Maybe this should be a config option, but I didn't know how to make them
cleanly in HPI, and I'm not sure if anyone would ever want this stuff.
2020-06-01 16:22:07 +01:00
Dima Gerasimov
f175acc848 pocket: reuse pockexport data access layer
BREAKING CHANGE! Data parsing was switched to pockexport.
This would help to keep it consistent across different apps in the future.

When you update, you'll need to:

- clone pockexport (latest version)
- set pockexport repository in your config (see doc/MODULES.org)
2020-05-27 08:42:47 +01:00
Dima Gerasimov
6453ff415d docs: somewhat acceptable data flow diagrams 2020-05-26 22:51:50 +01:00
Dima Gerasimov
150a6a8cb7 docs: wip on better explanation of configs/diagram 2020-05-26 22:51:50 +01:00
karlicoss
04eca6face
Merge pull request #55 from karlicoss/updates
cli updates: doctor mode
2020-05-25 12:30:18 +01:00
Dima Gerasimov
e351c8ba49 cli: add 'config init' command 2020-05-25 12:25:41 +01:00
Dima Gerasimov
7bd7cc9228 cli: integrate with stats reported by the modules 2020-05-25 11:46:30 +01:00
Dima Gerasimov
d890599c7c cli: add checks for importing modules 2020-05-25 11:41:44 +01:00
Dima Gerasimov
8019389ccb cli: move doctor to core, add doc 2020-05-25 10:17:40 +01:00
Dima Gerasimov
dab29a44b5 cli: detect config properly in mypy check 2020-05-25 10:04:58 +01:00
Dima Gerasimov
2ede5b3a5c cli: add config check command 2020-05-25 09:49:57 +01:00
karlicoss
ce8cd5b52c
Merge pull request #54 from karlicoss/updates
core: update warnings, add warn_if_empty decorator fore move defensive data sources
2020-05-25 01:28:42 +01:00
Dima Gerasimov
248e48dc30 core: improve types for warn_if_empty
ok, works with this advice https://github.com/python/mypy/issues/1927 + overloads
2020-05-25 01:23:30 +01:00
Dima Gerasimov
216944b3cd core: improvements for warnings, twitter/rss: try using @warn_if_empty 2020-05-25 00:56:03 +01:00
Dima Gerasimov
616ffb457e core: user overloads to type @warn_if_empty properly.. 2020-05-25 00:25:33 +01:00
Dima Gerasimov
e3a71ea6c6 my.core: more work on typing @warn_if_empty, extra test 2020-05-25 00:25:33 +01:00
Dima Gerasimov
4b22d17188 core: add @warn_if_empty decorator 2020-05-25 00:25:33 +01:00
karlicoss
af814df8e9
Merge pull request #53 from karlicoss/upd
make my.twitter.all easier to override
2020-05-24 23:02:57 +01:00
Dima Gerasimov
f5267d05d7 my.twitter.archive: rename config (preserving bckwd compatibility for now) 2020-05-24 13:06:52 +01:00
Dima Gerasimov
b99b2f3cfa core: add warning when get_files returns no files, my.twitter.archive: make more defensive in case of no archives 2020-05-24 12:51:23 +01:00
Dima Gerasimov
b7662378a2 docs: minor updates 2020-05-22 19:38:14 +01:00
Dima Gerasimov
03773a7b2c twitter module: prettify top level twitter.all 2020-05-22 19:00:02 +01:00
karlicoss
c410daa484
Merge pull request #52 from karlicoss/updates
Updates
2020-05-18 23:40:58 +01:00
Dima Gerasimov
02ba71a91d documentation: generate tables of content, better navigation 2020-05-18 23:31:55 +01:00
Dima Gerasimov
c8bdbfd69f core: expand '~' in get_files & import_dir 2020-05-18 22:43:27 +01:00
Dima Gerasimov
403ec18385 core/modules: get rid of set_repo uses, it was just complicating everythin 2020-05-18 21:33:52 +01:00
Dima Gerasimov
0f80e9d5e6 ok, seems that import_dir is a bit saner 2020-05-18 21:04:38 +01:00
Dima Gerasimov
44aa062756 tests: thinking about external repositories 2020-05-18 20:42:10 +01:00
karlicoss
41c5b34006
Merge pull request #51 from karlicoss/updates
Improve documentation for some modules
2020-05-17 22:10:58 +01:00
Dima Gerasimov
c0bbb4eaf2 misc: get rid of SimpleNamespace uses 2020-05-17 22:05:23 +01:00
Dima Gerasimov
2a9fd54c12 Improve documentation for some modules 2020-05-17 21:56:58 +01:00
karlicoss
c07ea0a600
Merge pull request #50 from karlicoss/polar
polar module updates
2020-05-17 14:01:49 +01:00
Dima Gerasimov
65138808e7 polar: handle few more attributes defensively 2020-05-15 13:17:02 +01:00
Dima Gerasimov
8277b33c18 polar: add highlight colors 2020-05-15 12:52:22 +01:00
Dima Gerasimov
3d8002c8c9 polar: support configuring defensive behaviour, support for highlight tags 2020-05-15 12:40:15 +01:00
Dima Gerasimov
844ebf28c1 polar: extract book tags 2020-05-15 11:49:30 +01:00
Dima Gerasimov
759b0e1324 polar: expose a proper filename 2020-05-15 10:11:09 +01:00
Dima Gerasimov
87ad9d38bb polar: add test for orger integration 2020-05-15 09:52:18 +01:00
Dima Gerasimov
0f27071dcc polar: minor improvements, konsume: more type annotations 2020-05-15 09:07:23 +01:00
Dima Gerasimov
f3d5064ff2 polar: allow properly specifying polar_dir, with ~ as a default 2020-05-15 08:18:47 +01:00
Dima Gerasimov
8f86d7706b core: use appdirs for ~/.config detection 2020-05-15 08:18:47 +01:00
Dima Gerasimov
b2b7eee480 polar: add test against custom public repos 2020-05-15 07:42:21 +01:00
Dima Gerasimov
647b6087dd add main HPI executable 2020-05-14 23:01:50 +01:00
Dima Gerasimov
6235e6ffae Make my.core a proper package (for brevity purposes) 2020-05-14 23:01:50 +01:00
Dima Gerasimov
8d998146e2 remove garbage org files, move example config down the hierarchy 2020-05-14 23:01:50 +01:00
karlicoss
d0427855e8
Merge pull request #48 from karlicoss/configuration
lastfmupdates: docs, lastfm, rss module
2020-05-13 23:07:50 +01:00
Dima Gerasimov
63d4198fd9 rss module: prettify & reorganize to allow for easily adding extra modules 2020-05-13 22:58:09 +01:00
Dima Gerasimov
92cf375480 move rss stuff in a separate subpackage 2020-05-13 22:58:09 +01:00
Dima Gerasimov
c289fbb872 rss: minor enhancements 2020-05-13 22:58:09 +01:00
Dima Gerasimov
eba2d26b31 Update lastfm order/tests/docs 2020-05-13 22:52:23 +01:00
Dima Gerasimov
522bfff679 update configuration doc & more tests 2020-05-13 22:52:23 +01:00
Dima Gerasimov
cda6bd51ce add py37 compatilibity helper for datetime.fromisoformat 2020-05-13 22:52:23 +01:00
karlicoss
1e6e0bd381
Merge pull request #47 from karlicoss/more-documentation
More documentation & tests
2020-05-10 22:48:19 +01:00
Dima Gerasimov
d7abff03fc add dataclasses dependency for python<3.7 2020-05-10 22:43:02 +01:00
Dima Gerasimov
1d0ef82d32 Add test demonstrating unloading the module during dynamic configuration 2020-05-10 21:57:55 +01:00
Dima Gerasimov
0ac78143f2 add my.demo for testing out various approaches to configuring 2020-05-10 21:32:48 +01:00
karlicoss
d6f071e3b1
Merge pull request #45 from karlicoss/better-configs
Better configs: safer and self documented
2020-05-10 18:11:52 +01:00
Dima Gerasimov
976b3da6f4 Autoextract documentation for some modules, improve docs 2020-05-10 18:09:12 +01:00
Dima Gerasimov
9cb39103c6 start autogenerating documentation on modules 2020-05-10 16:42:40 +01:00
Dima Gerasimov
e92ca215e3 Adapt takeout and twitter configs to the new pattern
Works fairly well so far?
2020-05-10 15:56:57 +01:00
Dima Gerasimov
8cbbafae1d extract dataclass-based config helper 2020-05-10 15:18:45 +01:00
Dima Gerasimov
217116dfe9 Use @dataclass with reddit, seems to work well 2020-05-10 14:47:02 +01:00
Dima Gerasimov
051cbe3e38 update config documentation even more 2020-05-10 13:27:25 +01:00
Dima Gerasimov
9206366184 more requirements for the configuration 2020-05-10 12:05:36 +01:00
Dima Gerasimov
08dffac7b4 explain some rationales about the config format 2020-05-10 10:34:50 +01:00
Dima Gerasimov
5fd5b91b92 Try the NamedTuple apptoach got google takeouts 2020-05-09 23:32:30 +01:00
Dima Gerasimov
c877104b90 another attempt to make the configs more self-documenting: via NamedTuple 2020-05-09 23:17:44 +01:00
Dima Gerasimov
4b8c2d4be4 Inherit from the base config 2020-05-09 22:21:15 +01:00
Dima Gerasimov
90b9d1d9c1 Use Protocol for proper config documentation 2020-05-09 21:38:25 +01:00
Dima Gerasimov
c75747f371 improving config documentation and allowing for fallbacks 2020-05-09 20:06:07 +01:00
Dima Gerasimov
66453cb29b add @classproperty, change set_repo to not require the parent 2020-05-09 20:03:35 +01:00
Dima Gerasimov
5f4acfddee add takeout example 2020-05-08 16:57:59 +01:00
Dima Gerasimov
505a2b22ae fix my.config.repos stub 2020-05-07 08:39:30 +01:00
karlicoss
40b6a82b7c
Merge pull request #42 from karlicoss/updates
cleanup, move stuff to my.core, update docs
2020-05-06 23:23:41 +01:00
Dima Gerasimov
d4a430e12e update dev docs 2020-05-06 23:21:29 +01:00
Dima Gerasimov
6ecb953675 cleanup, mypy coverage & add common/error stubs 2020-05-06 22:54:14 +01:00
Dima Gerasimov
15444c7b1f move common/error to my.core 2020-05-06 22:36:29 +01:00
Dima Gerasimov
eb97021b8e improve lint script, explore subpackages 2020-05-06 22:28:37 +01:00
Dima Gerasimov
b7e5640f35 move init.py to my.core 2020-05-06 22:20:00 +01:00
Dima Gerasimov
9d5d368891 get rid of unnecessary .init imports 2020-05-06 22:05:16 +01:00
Dima Gerasimov
069732600c cleanup for reddit data provider 2020-05-06 08:09:20 +01:00
Dima Gerasimov
5d3c0bdb1f update with_my script, use correct order of arguments 2020-05-05 22:22:32 +01:00
Dima Gerasimov
6d1fba2171 Extra test for MY_CONFIG variable; fix order import for stub/dynamic config 2020-05-05 22:22:32 +01:00
Dima Gerasimov
636060db57 Simplify config discovery: get rid of the hacky stub and reimport proper config automatically 2020-05-05 22:22:32 +01:00
Dima Gerasimov
fd224d8c38 add test for config.set_repo 2020-05-04 22:08:58 +01:00
Dima Gerasimov
4cceccd787 add test for dynamic config attributes (import my.cfg as config) 2020-05-04 22:08:58 +01:00
Dima Gerasimov
fe763c3c04 Fix my.config handling during mypy 2020-05-04 19:52:18 +01:00
Dima Gerasimov
1f07e1a2a8 enable mypy on CI for core stuff 2020-05-04 19:52:18 +01:00
Dima Gerasimov
3912ef2460 fix zstd handling and github wrapper 2020-05-04 19:52:18 +01:00
karlicoss
77d557e172
Merge pull request #38 from karlicoss/updates
More uniform handling for compressed files
2020-05-04 08:57:48 +01:00
Dima Gerasimov
55ac85c7e7 cpath tests, rely more on it 2020-05-04 08:53:41 +01:00
Dima Gerasimov
8b8a85e8c3 kompress.kopen improvements
- tests
- uniform handling for bytes/str, always return utf8 str by default
2020-05-04 08:37:36 +01:00
Dima Gerasimov
c3a77b6256 initial kompress tests 2020-05-04 07:50:29 +01:00
Dima Gerasimov
db47ba2d7e Move get_files tests to separate file 2020-05-04 07:17:20 +01:00
karlicoss
5aecc037e9
Merge pull request #37 from karlicoss/updates
various updates: implicit globs for get-files, mcachew type checking, modules cleanup
2020-05-03 17:19:55 +01:00
Dima Gerasimov
0b61dd9e42 more minor tweaks, benefit from get_files 2020-05-03 17:15:51 +01:00
Dima Gerasimov
9bd61940b8 rely on implicit glob for my.reddit 2020-05-03 16:56:05 +01:00
Dima Gerasimov
5706f690e7 support implicit globs! 2020-05-03 16:52:09 +01:00
Dima Gerasimov
c2961cb1cf properly test get_files 2020-05-03 16:30:57 +01:00
Dima Gerasimov
5c6eec62ee start testing get_files 2020-05-03 16:17:48 +01:00
Dima Gerasimov
19e90eb647 improvements to @mcachew type checking 2020-05-03 15:57:11 +01:00
Dima Gerasimov
78dbbd3c55 prettify emfit provider 2020-05-03 13:42:31 +01:00
Dima Gerasimov
2bf62e2db3 fix photo link 2020-05-03 12:26:18 +01:00
Dima Gerasimov
22e2d68e5d cleanup hypothesis module 2020-05-03 10:27:58 +01:00
Dima Gerasimov
a521885aa0 prettify github extractors 2020-05-03 10:08:53 +01:00
Dima Gerasimov
4244f403ed simplify instapaper module 2020-05-03 08:22:15 +01:00
Dima Gerasimov
81ca1e2c25 macos ci 2020-04-26 16:57:30 +01:00
Dima Gerasimov
526e7d3fa9 update documentation of private config 2020-04-26 16:50:06 +01:00
Dima Gerasimov
37842ea45c update readme to reflect blog changes 2020-04-26 16:50:06 +01:00
Dima Gerasimov
51ae8601b4 Update docstrings and add links 2020-04-26 16:50:06 +01:00
Dima Gerasimov
cd3f2996a3 port tests for takeout from kython 2020-04-26 16:50:06 +01:00
Dima Gerasimov
04605d9c09 attemtp to use post versions 2020-04-24 22:01:15 +01:00
Dima Gerasimov
4e861eb2b2 fix namespace packages finding 2020-04-24 21:44:03 +01:00
karlicoss
0d4bcc1d7c
Merge pull request #33 from karlicoss/updates
Google takeout updates
2020-04-24 18:55:54 +01:00
Dima Gerasimov
a84b51807f more takeout to a separate subpackage 2020-04-24 18:10:33 +01:00
Dima Gerasimov
d1aa4d19dc get rid of callbacks in takeout processing interface 2020-04-24 17:34:56 +01:00
Dima Gerasimov
810fe21839 attempt to use xmllint to speed up takeout parsing 2020-04-24 16:35:20 +01:00
Dima Gerasimov
adadffef16 add takeout parser test 2020-04-24 16:11:19 +01:00
Dima Gerasimov
60ccca52ad more takeout tweaks and comments 2020-04-24 15:57:44 +01:00
Dima Gerasimov
21e82f0cd6 add disable_cachew helper 2020-04-24 15:19:31 +01:00
Dima Gerasimov
121ed58c17 add pytest config, add hack for reddit tests 2020-04-21 19:25:20 +01:00
Dima Gerasimov
bc0794cc37 add traverse() to roam 2020-04-21 19:25:20 +01:00
Dima Gerasimov
96a850faf9 remove unnecessary methods from twitter provider 2020-04-20 08:38:01 +01:00
karlicoss
bfe3165f45
Merge pull request #31 from karlicoss/cpath-windows
Portable CPath
2020-04-19 22:16:16 +01:00
Dima Gerasimov
ab61a95701 specify encoding for uncompressed files in kompress.kopen 2020-04-19 21:50:46 +01:00
Dima Gerasimov
4cd6df86cf Portable CPath
fixes https://github.com/karlicoss/HPI/issues/28
2020-04-19 21:01:56 +01:00
karlicoss
b911796c15
Merge pull request #26 from karlicoss/roam
Roam Research module
2020-04-19 17:59:12 +01:00
Dima Gerasimov
caabe4a3c8 fix test pip deployments 2020-04-19 17:58:17 +01:00
Dima Gerasimov
39860862ae rename nodes -> notes 2020-04-19 17:55:15 +01:00
Dima Gerasimov
4a83ff4864 hacks to work around python3.6 imports; add set_repo 2020-04-19 12:42:00 +01:00
Dima Gerasimov
d0fd6f822a split out path from permalink 2020-04-19 00:55:18 +01:00
Dima Gerasimov
7b9266b25d created date: add fallback for missing/unexpected title format 2020-04-18 21:16:07 +01:00
Dima Gerasimov
72d5616898 add support for permalinks and guess created time for daily notes 2020-04-18 21:16:07 +01:00
Dima Gerasimov
575de57fb6 Initial data provider for roam research 2020-04-18 21:16:07 +01:00
karlicoss
e884d90ea0
Merge pull request #25 from karlicoss/updates
setup guide updates
2020-04-18 16:27:20 +01:00
Dima Gerasimov
cb8bba4e66 reformat setup.org 2020-04-18 16:25:45 +01:00
Dima Gerasimov
185fa9aabd update readme 2020-04-18 16:16:54 +01:00
306 changed files with 24687 additions and 4305 deletions

66
.ci/release Executable file
View file

@ -0,0 +1,66 @@
#!/usr/bin/env python3
'''
Run [[file:scripts/release][scripts/release]] to deploy Python package onto [[https://pypi.org][PyPi]] and [[https://test.pypi.org][test PyPi]].
The script expects =TWINE_PASSWORD= environment variable to contain the [[https://pypi.org/help/#apitoken][PyPi token]] (not the password!).
The script can be run manually.
It's also running as =pypi= job in [[file:.github/workflows/main.yml][Github Actions config]]. Packages are deployed on:
- every master commit, onto test pypi
- every new tag, onto production pypi
You'll need to set =TWINE_PASSWORD= and =TWINE_PASSWORD_TEST= in [[https://help.github.com/en/actions/configuring-and-managing-workflows/creating-and-storing-encrypted-secrets#creating-encrypted-secrets][secrets]]
for Github Actions deployment to work.
'''
import os
import sys
from pathlib import Path
from subprocess import check_call
import shutil
is_ci = os.environ.get('CI') is not None
def main() -> None:
import argparse
p = argparse.ArgumentParser()
p.add_argument('--test', action='store_true', help='use test pypi')
args = p.parse_args()
extra = []
if args.test:
extra.extend(['--repository', 'testpypi'])
root = Path(__file__).absolute().parent.parent
os.chdir(root) # just in case
if is_ci:
# see https://github.com/actions/checkout/issues/217
check_call('git fetch --prune --unshallow'.split())
dist = root / 'dist'
if dist.exists():
shutil.rmtree(dist)
check_call(['python3', '-m', 'build'])
TP = 'TWINE_PASSWORD'
password = os.environ.get(TP)
if password is None:
print(f"WARNING: no {TP} passed", file=sys.stderr)
import pip_secrets
password = pip_secrets.token_test if args.test else pip_secrets.token # meh
check_call([
'python3', '-m', 'twine',
'upload', *dist.iterdir(),
*extra,
], env={
'TWINE_USERNAME': '__token__',
TP: password,
**os.environ,
})
if __name__ == '__main__':
main()

48
.ci/run Executable file
View file

@ -0,0 +1,48 @@
#!/bin/bash
set -eu
cd "$(dirname "$0")"
cd .. # git root
if ! command -v sudo; then
# CI or Docker sometimes doesn't have it, so useful to have a dummy
function sudo {
"$@"
}
fi
# --parallel-live to show outputs while it's running
tox_cmd='run-parallel --parallel-live'
if [ -n "${CI-}" ]; then
# install OS specific stuff here
case "$OSTYPE" in
darwin*)
# macos
brew install fd
;;
cygwin* | msys* | win*)
# windows
# ugh. parallel stuff seems super flaky under windows, some random failures, "file used by other process" and crap like that
tox_cmd='run'
;;
*)
# must be linux?
sudo apt update
sudo apt install fd-find
;;
esac
fi
PY_BIN="python3"
# some systems might have python pointing to python3
if ! command -v python3 &> /dev/null; then
PY_BIN="python"
fi
# TODO hmm for some reason installing uv with pip and then running
# "$PY_BIN" -m uv tool fails with missing setuptools error??
# just uvx directly works, but it's not present in PATH...
"$PY_BIN" -m pip install --user pipx
"$PY_BIN" -m pipx run uv tool run --with=tox-uv tox $tox_cmd "$@"

View file

@ -1,63 +1,106 @@
# see https://github.com/karlicoss/pymplate for up-to-date reference # see https://github.com/karlicoss/pymplate for up-to-date reference
name: CI name: CI
on: [push] on:
push:
branches: '*'
tags: 'v[0-9]+.*' # only trigger on 'release' tags for PyPi
# Ideally I would put this in the pypi job... but github syntax doesn't allow for regexes there :shrug:
pull_request: # needed to trigger on others' PRs
# Note that people who fork it need to go to "Actions" tab on their fork and click "I understand my workflows, go ahead and enable them".
workflow_dispatch: # needed to trigger workflows manually
# todo cron?
inputs:
debug_enabled:
type: boolean
description: 'Run the build with tmate debugging enabled (https://github.com/marketplace/actions/debugging-with-tmate)'
required: false
default: false
jobs: jobs:
build: build:
runs-on: ubuntu-latest
strategy: strategy:
fail-fast: false
matrix: matrix:
python-version: [3.6, 3.7, 3.8] platform: [ubuntu-latest, macos-latest, windows-latest]
# TODO shit. matrix is going to prevent from twine deployments because of version conflicts?? python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
# add 'and' clause?? exclude: [
# windows runners are pretty scarce, so let's only run lowest and highest python version
{platform: windows-latest, python-version: '3.10'},
{platform: windows-latest, python-version: '3.11'},
{platform: windows-latest, python-version: '3.12'},
# same, macos is a bit too slow and ubuntu covers python quirks well
{platform: macos-latest , python-version: '3.10' },
{platform: macos-latest , python-version: '3.11' },
{platform: macos-latest , python-version: '3.12' },
]
runs-on: ${{ matrix.platform }}
# useful for 'optional' pipelines
# continue-on-error: ${{ matrix.platform == 'windows-latest' }}
steps: steps:
# fuck me. https://help.github.com/en/actions/reference/workflow-commands-for-github-actions#adding-a-system-path # ugh https://github.com/actions/toolkit/blob/main/docs/commands.md#path-manipulation
- run: echo "::add-path::$HOME/.local/bin" - run: echo "$HOME/.local/bin" >> $GITHUB_PATH
- uses: actions/setup-python@v1 - uses: actions/setup-python@v5
with: with:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
- uses: actions/checkout@v2 - uses: actions/checkout@v4
with: with:
submodules: recursive submodules: recursive
fetch-depth: 0 # nicer to have all git history when debugging/for tests
# uncomment for SSH debugging - uses: mxschmitt/action-tmate@v3
# - uses: mxschmitt/action-tmate@v2 if: ${{ github.event_name == 'workflow_dispatch' && inputs.debug_enabled }}
- run: scripts/ci/run # explicit bash command is necessary for Windows CI runner, otherwise it thinks it's cmd...
- run: bash .ci/run
- if: matrix.platform == 'ubuntu-latest' # no need to compute coverage for other platforms
uses: actions/upload-artifact@v4
with:
include-hidden-files: true
name: .coverage.mypy-misc_${{ matrix.platform }}_${{ matrix.python-version }}
path: .coverage.mypy-misc/
- if: matrix.platform == 'ubuntu-latest' # no need to compute coverage for other platforms
uses: actions/upload-artifact@v4
with:
include-hidden-files: true
name: .coverage.mypy-core_${{ matrix.platform }}_${{ matrix.python-version }}
path: .coverage.mypy-core/
pypi: pypi:
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: build needs: [build] # add all other jobs here
steps: steps:
- run: echo "::add-path::$HOME/.local/bin" # ugh https://github.com/actions/toolkit/blob/main/docs/commands.md#path-manipulation
- run: echo "$HOME/.local/bin" >> $GITHUB_PATH
- uses: actions/setup-python@v1 - uses: actions/setup-python@v5
with: with:
python-version: 3.7 python-version: '3.10'
- uses: actions/checkout@v2 - uses: actions/checkout@v4
with: with:
submodules: recursive submodules: recursive
- name: 'release to test pypi' - name: 'release to test pypi'
# always deploy merged master to test pypi # always deploy merged master to test pypi
if: github.event.ref == 'refs/heads/master' if: github.event_name != 'pull_request' && github.event.ref == 'refs/heads/master'
env: env:
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD_TEST }} TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD_TEST }}
run: pip3 install --user wheel twine && scripts/release --test run: pip3 install --user --upgrade build twine && .ci/release --test
- name: 'release to pypi' - name: 'release to pypi'
# always deploy tags to release pypi # always deploy tags to release pypi
# TODO filter release tags only? # NOTE: release tags are guarded by on: push: tags on the top
if: startsWith(github.event.ref, 'refs/tags') if: github.event_name != 'pull_request' && startsWith(github.event.ref, 'refs/tags')
env: env:
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }} TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
run: pip3 install --user wheel twine && scripts/release run: pip3 install --user --upgrade build twine && .ci/release
# todo generate mypy coverage artifacts?

7
.gitignore vendored
View file

@ -12,6 +12,7 @@
auto-save-list auto-save-list
tramp tramp
.\#* .\#*
*.gpx
# Org-mode # Org-mode
.org-id-locations .org-id-locations
@ -154,7 +155,13 @@ celerybeat-schedule
.dmypy.json .dmypy.json
dmypy.json dmypy.json
# linters
.ruff_cache/
# Pyre type checker # Pyre type checker
.pyre/ .pyre/
# End of https://www.gitignore.io/api/python,emacs # End of https://www.gitignore.io/api/python,emacs
cov/
*.png

6
.gitmodules vendored Normal file
View file

@ -0,0 +1,6 @@
[submodule "testdata/hpi-testdata"]
path = testdata/hpi-testdata
url = https://github.com/karlicoss/hpi-testdata
[submodule "testdata/track"]
path = testdata/track
url = https://github.com/tajtiattila/track

View file

@ -1 +0,0 @@
- /.mypy_cache/

58
CHANGELOG.md Normal file
View file

@ -0,0 +1,58 @@
# `v0.3.20210220`
General/my.core changes:
- a3305677b24694391a247fc4cb6cc1237e57f840 **deprecate*** my.cfg, instead my.config can (and should be) used directly
- 0534c5c57dc420f9a01387b58a7098823e54277e new cli feature: **module management**
cli: add `hpi module install` and `hpi module requires`
relevant: https://github.com/karlicoss/HPI/issues/12, https://github.com/karlicoss/HPI/issues/79
- 97650adf3b48c653651b31c78cefe24ecae5ed4f add discovery_pure module to get modules and their dependencies via `ast` module
- f90599d7e4463e936c8d95196ff767c730207202 make module discovery rely on `ast` module
Hopefully it will make it more robust & much faster.
- 07f901e1e5fb2bd3009561c84cc4efd311c94733 helpers for **automatic dataframes** from sequences of NamedTuple/dataclass
- 4012f9b7c2a429170df8600591ec8d1e1407b162 more generic functions to jsonify data
- 746c3da0cadcba3b179688783186d8a0bd0999c5 core.pandas: allow specifying schema; add tests
- 5313984d8fea2b6eef6726b7b346c1f4316acd01 add `tmp_config` context manager for test & adhoc patching
- df9a7f7390aee6c69f1abf1c8d1fc7659ebb957c core.pandas: add check for 'error' column + add empty one by default
- e81dddddf083ffd81aa7e2b715bd34f59949479c properly resolve class properties in make_config + add test
Modules:
- some initial work on filling **InfluxDB** with HPI data
- pinboard
- 42399f6250d9901d93dcedcfe05f7857babcf834: **breaking backwards compatibility**, use pinbexport module directly
Use 'hpi module install my.pinboard' to install it
relevant: https://github.com/karlicoss/HPI/issues/79
- stackexchange
- 63c825ab81bb561e912655e423c6b332fb6fd1b4 use GDPR data for votes
- ddea816a49f5da79fd6332e7f6b879b1955838af use proper pip package, add stat
- bluemaestro
- 6d9bc2964b24cfe6187945f4634940673dfe9c27 populate grafana
- 1899b006de349140303110ca98a21d918d9eb049 investigation of data quality + more sanity checks
- d77ab92d8634d0863d2b966cb448bbfcc8a8d565 get rid of unnecessary file, move to top level
- runnerup
- 6b451336ed5df2b893c9e6387175edba50b0719b Initial parser for RunnerUp data which I'm now using instead of Endomondo
Misc:
- f102101b3917e8a38511faa5e4fd9dd33d284d7e core/windows: fix get_files and its tests
- 56d5587c209dcbd27c7802d60c0bc8e8e2391672 CI: clean up tox config a bit, get rid of custom lint script
- d562f00dca720fd4f6736377a41168e9a796c122
tests: run all tests, but exclude tests specific to my computer from CI
controllable via `HPI_TESTS_KARLICOSS=true`
- improved mypy coverage
# before `v0.2.20201125`
I used to keep it in [Github releases](https://github.com/karlicoss/HPI/releases).
However I realized it's means promoting a silo, so now it's reflected in this file (and only copied to github releases page).

View file

@ -1,15 +1,24 @@
# TODO ugh. my blog generator dumps links as file: ....
# so used smeth like :s/file:\(.*\)\.org/https:\/\/beepb00p.xyz\/\1.html/gc -- steal leaves ::# links etc. ugh
#+summary: My life in a Python package #+summary: My life in a Python package
#+created: [2019-11-14 Thu]
#+filetags: :infra:pkm:quantifiedself:hpi: #+filetags: :infra:pkm:quantifiedself:hpi:
#+upid: mypkg
#+upid: hpi
#+macro: map @@html:<span style='color:darkgreen; font-weight: bolder'>@@$1@@html:</span>@@ #+macro: map @@html:<span style='color:darkgreen; font-weight: bolder'>@@$1@@html:</span>@@
#+macro: extraid @@html:<span style='visibility:hidden' id="$1"></span>@@ If you're in a hurry, feel free to jump straight to the [[#usecases][demos]].
- see [[https://github.com/karlicoss/HPI/tree/master/doc/SETUP.org][SETUP]] for the *installation/configuration guide*
- see [[https://github.com/karlicoss/HPI/tree/master/doc/DEVELOPMENT.org][DEVELOPMENT]] for the *development guide*
- see [[https://github.com/karlicoss/HPI/tree/master/doc/DESIGN.org][DESIGN]] for the *design goals*
- see [[https://github.com/karlicoss/HPI/tree/master/doc/MODULES.org][MODULES]] for *module-specific setup*
- see [[https://github.com/karlicoss/HPI/tree/master/doc/MODULE_DESIGN.org][MODULE_DESIGN]] for some thoughts on structuring modules, and possibly *extending HPI*
- see [[https://beepb00p.xyz/exobrain/projects/hpi.html][exobrain/HPI]] for some of my raw thoughts and todos on the project
*TLDR*: I'm using [[https://github.com/karlicoss/HPI][HPI]] (Human Programming Interface) package as a means of unifying, accessing and interacting with all of my personal data. *TLDR*: I'm using [[https://github.com/karlicoss/HPI][HPI]] (Human Programming Interface) package as a means of unifying, accessing and interacting with all of my personal data.
It's a Python library (named ~my~), a collection of modules for: HPI is a Python package (named ~my~), a collection of modules for:
- social networks: posts, comments, favorites - social networks: posts, comments, favorites
- reading: e-books and pdfs - reading: e-books and pdfs
@ -27,12 +36,11 @@ You simply 'import' your data and get to work with familiar Python types and dat
- Here's a short example to give you an idea: "which subreddits I find the most interesting?" - Here's a short example to give you an idea: "which subreddits I find the most interesting?"
#+begin_src python #+begin_src python
import my.reddit import my.reddit.all
from collections import Counter from collections import Counter
return Counter(s.subreddit for s in my.reddit.saved()).most_common(4) return Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4)
#+end_src #+end_src
| orgmode | 62 | | orgmode | 62 |
| emacs | 60 | | emacs | 60 |
| selfhosted | 51 | | selfhosted | 51 |
@ -40,10 +48,10 @@ You simply 'import' your data and get to work with familiar Python types and dat
I consider my digital trace an important part of my identity. ([[https://beepb00p.xyz/tags.html#extendedmind][#extendedmind]]) I consider my digital trace an important part of my identity. ([[https://beepb00p.xyz/tags.html#extendedmind][#extendedmind]])
The fact that the data is siloed, and accessing it is inconvenient and borderline frustrating feels very wrong. Usually the data is siloed, accessing it is inconvenient and borderline frustrating. This feels very wrong.
Once the data is available as Python objects, I can easily plug it into existing tools, libraries and frameworks. In contrast, once the data is available as Python objects, I can easily plug it into existing tools, libraries and frameworks.
It makes building new tools considerably easier and allows creating new ways of interacting with the data. It makes building new tools considerably easier and opens up new ways of interacting with the data.
I tried different things over the years and I think I'm getting to the point where other people can also benefit from my code by 'just' plugging in their data, I tried different things over the years and I think I'm getting to the point where other people can also benefit from my code by 'just' plugging in their data,
and that's why I'm sharing this. and that's why I'm sharing this.
@ -51,10 +59,6 @@ and that's why I'm sharing this.
Imagine if all your life was reflected digitally and available at your fingertips. Imagine if all your life was reflected digitally and available at your fingertips.
This library is my attempt to achieve this vision. This library is my attempt to achieve this vision.
If you're in a hurry, feel free to jump straight to the [[#usecases][demos]].
For *installation/configuration/development guide*, see [[https://github.com/karlicoss/HPI/tree/master/doc/SETUP.org][SETUP.org]].
#+toc: headlines 2 #+toc: headlines 2
@ -72,6 +76,8 @@ For *installation/configuration/development guide*, see [[https://github.com/kar
- Accessing exercise data - Accessing exercise data
- Book reading progress - Book reading progress
- Messenger stats - Messenger stats
- Which month in 2020 did I make the most git commits in?
- Querying Roam Research database
- How does it get input data? - How does it get input data?
- Q & A - Q & A
- Why Python? - Why Python?
@ -81,11 +87,16 @@ For *installation/configuration/development guide*, see [[https://github.com/kar
- But /should/ I use it? - But /should/ I use it?
- Would it suit /me/? - Would it suit /me/?
- What it isn't? - What it isn't?
- HPI Repositories
- Related links - Related links
- -- - --
:END: :END:
* Why? * Why?
:PROPERTIES:
:CUSTOM_ID: motivation
:END:
The main reason that led me to develop this is the dissatisfaction of the current situation: The main reason that led me to develop this is the dissatisfaction of the current situation:
- Our personal data is siloed and trapped across cloud services and various devices - Our personal data is siloed and trapped across cloud services and various devices
@ -96,7 +107,7 @@ The main reason that led me to develop this is the dissatisfaction of the curren
Integrations of data across silo boundaries are almost non-existent. There is so much potential and it's all wasted. Integrations of data across silo boundaries are almost non-existent. There is so much potential and it's all wasted.
- I'm not willing to wait till some vaporwave project reinvents the whole computing model from scratch - I'm not willing to wait till some vaporware project reinvents the whole computing model from scratch
As a programmer, I am in capacity to do something *right now*, even though it's not necessarily perfect and consistent. As a programmer, I am in capacity to do something *right now*, even though it's not necessarily perfect and consistent.
@ -176,15 +187,22 @@ But the major reason I want to solve these problems is to be better at learning
so I could be better at solving the real problems. so I could be better at solving the real problems.
* How does a Python package help? * How does a Python package help?
:PROPERTIES:
:CUSTOM_ID: package
:END:
When I started solving some of these problems for myself, I've noticed a common pattern: the [[https://beepb00p.xyz/sad-infra.html#exports_are_hard][hardest bit]] is actually getting your data in the first place. When I started solving some of these problems for myself, I've noticed a common pattern: the [[https://beepb00p.xyz/sad-infra.html#exports_are_hard][hardest bit]] is actually getting your data in the first place.
It's inherently error-prone and frustrating. It's inherently error-prone and frustrating.
But once you have the data in a convenient representation, working with it is pleasant -- you get to explore and build instead of fighting with yet another stupid REST API. But once you have the data in a convenient representation, working with it is pleasant -- you get to *explore and build instead of fighting with yet another stupid REST API*.
This python package knows how to find data, deserialize it and normalize it to the convenient representation. This package knows how to find data on your filesystem, deserialize it and normalize it to a convenient representation.
You have the full power of the programming language to transform the data and do whatever comes to your mind. You have the full power of the programming language to transform the data and do whatever comes to your mind.
** Why don't you just put everything in a massive database? ** Why don't you just put everything in a massive database?
:PROPERTIES:
:CUSTOM_ID: database
:END:
Glad you've asked! I wrote a whole [[https://beepb00p.xyz/unnecessary-db.html][post]] about it. Glad you've asked! I wrote a whole [[https://beepb00p.xyz/unnecessary-db.html][post]] about it.
In short: while databases are efficient and easy to read from, often they aren't flexible enough to fit your data. In short: while databases are efficient and easy to read from, often they aren't flexible enough to fit your data.
@ -195,33 +213,61 @@ That's where a Python package comes in.
* What's inside? * What's inside?
Here's an (incomplete) list of the modules in the public package: :PROPERTIES:
:CUSTOM_ID: modules
:END:
Here's the (incomplete) list of the modules:
:results: :results:
| [[https://github.com/karlicoss/my/tree/master/my/bluemaestro][my.bluemaestro]] | [[https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger][Bluemaestro]] temperature/humidity/pressure monitor | | [[https://github.com/karlicoss/HPI/tree/master/my/bluemaestro.py][=my.bluemaestro=]] | [[https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger][Bluemaestro]] temperature/humidity/pressure monitor |
| [[https://github.com/karlicoss/my/tree/master/my/body/blood.py][my.body.blood]] | Blood tracking | | [[https://github.com/karlicoss/HPI/tree/master/my/body/blood.py][=my.body.blood=]] | Blood tracking (manual org-mode entries) |
| [[https://github.com/karlicoss/my/tree/master/my/body/weight.py][my.body.weight]] | Weight data (manually logged) | | [[https://github.com/karlicoss/HPI/tree/master/my/body/exercise/all.py][=my.body.exercise.all=]] | Combined exercise data |
| [[https://github.com/karlicoss/my/tree/master/my/books/kobo.py][my.books.kobo]] | Kobo e-ink reader: annotations and reading stats | | [[https://github.com/karlicoss/HPI/tree/master/my/body/exercise/cardio.py][=my.body.exercise.cardio=]] | Cardio data, filtered from various data sources |
| [[https://github.com/karlicoss/my/tree/master/my/calendar/holidays.py][my.calendar.holidays]] | Provides data on days off work (based on public holidays + manual inputs) | | [[https://github.com/karlicoss/HPI/tree/master/my/body/exercise/cross_trainer.py][=my.body.exercise.cross_trainer=]] | My cross trainer exercise data, arbitrated from different sources (mainly, Endomondo and manual text notes) |
| [[https://github.com/karlicoss/my/tree/master/my/coding/commits.py][my.coding.commits]] | Git commits data: crawls filesystem | | [[https://github.com/karlicoss/HPI/tree/master/my/body/weight.py][=my.body.weight=]] | Weight data (manually logged) |
| [[https://github.com/karlicoss/my/tree/master/my/coding/github.py][my.coding.github]] | Github events and their metadata: comments/issues/pull requests | | [[https://github.com/karlicoss/HPI/tree/master/my/calendar/holidays.py][=my.calendar.holidays=]] | Holidays and days off work |
| [[https://github.com/karlicoss/my/tree/master/my/emfit][my.emfit]] | [[https://shop-eu.emfit.com/products/emfit-qs][Emfit QS]] sleep tracker | | [[https://github.com/karlicoss/HPI/tree/master/my/coding/commits.py][=my.coding.commits=]] | Git commits data for repositories on your filesystem |
| [[https://github.com/karlicoss/my/tree/master/my/fbmessenger.py][my.fbmessenger]] | Module for Facebook Messenger messages | | [[https://github.com/karlicoss/HPI/tree/master/my/demo.py][=my.demo=]] | Just a demo module for testing and documentation purposes |
| [[https://github.com/karlicoss/my/tree/master/my/feedbin.py][my.feedbin]] | Module for Feedbin RSS reader | | [[https://github.com/karlicoss/HPI/tree/master/my/emfit/__init__.py][=my.emfit=]] | [[https://shop-eu.emfit.com/products/emfit-qs][Emfit QS]] sleep tracker |
| [[https://github.com/karlicoss/my/tree/master/my/feedly.py][my.feedly]] | Module for Feedly RSS reader | | [[https://github.com/karlicoss/HPI/tree/master/my/endomondo.py][=my.endomondo=]] | Endomondo exercise data |
| [[https://github.com/karlicoss/my/tree/master/my/hypothesis.py][my.hypothesis]] | Hypothes.is highlights and annotations | | [[https://github.com/karlicoss/HPI/tree/master/my/fbmessenger.py][=my.fbmessenger=]] | Facebook Messenger messages |
| [[https://github.com/karlicoss/my/tree/master/my/instapaper.py][my.instapaper]] | Instapaper bookmarks, highlights and annotations | | [[https://github.com/karlicoss/HPI/tree/master/my/foursquare.py][=my.foursquare=]] | Foursquare/Swarm checkins |
| [[https://github.com/karlicoss/my/tree/master/my/location/takeout.py][my.location.takeout]] | Module for Google Takeout data | | [[https://github.com/karlicoss/HPI/tree/master/my/github/all.py][=my.github.all=]] | Unified Github data (merged from GDPR export and periodic API updates) |
| [[https://github.com/karlicoss/my/tree/master/my/materialistic.py][my.materialistic]] | Module for [[https://play.google.com/store/apps/details?id=io.github.hidroh.materialistic][Materialistic]] app for Hackernews | | [[https://github.com/karlicoss/HPI/tree/master/my/github/gdpr.py][=my.github.gdpr=]] | Github data (uses [[https://github.com/settings/admin][official GDPR export]]) |
| [[https://github.com/karlicoss/my/tree/master/my/notes/orgmode.py][my.notes.orgmode]] | Programmatic access and queries to org-mode files on the filesystem | | [[https://github.com/karlicoss/HPI/tree/master/my/github/ghexport.py][=my.github.ghexport=]] | Github data: events, comments, etc. (API data) |
| [[https://github.com/karlicoss/my/tree/master/my/photos][my.photos]] | Module for accessing photos and videos, with their GPS and timestamps | | [[https://github.com/karlicoss/HPI/tree/master/my/hypothesis.py][=my.hypothesis=]] | [[https://hypothes.is][Hypothes.is]] highlights and annotations |
| [[https://github.com/karlicoss/my/tree/master/my/pinboard.py][my.pinboard]] | Module for pinboard.in bookmarks | | [[https://github.com/karlicoss/HPI/tree/master/my/instapaper.py][=my.instapaper=]] | [[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations |
| [[https://github.com/karlicoss/my/tree/master/my/reading/polar.py][my.reading.polar]] | Module for Polar articles and highlights | | [[https://github.com/karlicoss/HPI/tree/master/my/kobo.py][=my.kobo=]] | [[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats |
| [[https://github.com/karlicoss/my/tree/master/my/reddit.py][my.reddit]] | Module for Reddit data: saved items/comments/upvotes etc | | [[https://github.com/karlicoss/HPI/tree/master/my/lastfm.py][=my.lastfm=]] | Last.fm scrobbles |
| [[https://github.com/karlicoss/my/tree/master/my/rtm.py][my.rtm]] | [[https://rememberthemilk.com][Remember The Milk]] tasks and notes | | [[https://github.com/karlicoss/HPI/tree/master/my/location/google.py][=my.location.google=]] | Location data from Google Takeout |
| [[https://github.com/karlicoss/my/tree/master/my/smscalls.py][my.smscalls]] | Phone calls and SMS messages | | [[https://github.com/karlicoss/HPI/tree/master/my/location/home.py][=my.location.home=]] | Simple location provider, serving as a fallback when more detailed data isn't available |
| [[https://github.com/karlicoss/my/tree/master/my/twitter.py][my.twitter]] | Module for Twitter (uses official twitter archive export) | | [[https://github.com/karlicoss/HPI/tree/master/my/materialistic.py][=my.materialistic=]] | [[https://play.google.com/store/apps/details?id=io.github.hidroh.materialistic][Materialistic]] app for Hackernews |
| [[https://github.com/karlicoss/HPI/tree/master/my/orgmode.py][=my.orgmode=]] | Programmatic access and queries to org-mode files on the filesystem |
| [[https://github.com/karlicoss/HPI/tree/master/my/pdfs.py][=my.pdfs=]] | PDF documents and annotations on your filesystem |
| [[https://github.com/karlicoss/HPI/tree/master/my/photos/main.py][=my.photos.main=]] | Photos and videos on your filesystem, their GPS and timestamps |
| [[https://github.com/karlicoss/HPI/tree/master/my/pinboard.py][=my.pinboard=]] | [[https://pinboard.in][Pinboard]] bookmarks |
| [[https://github.com/karlicoss/HPI/tree/master/my/pocket.py][=my.pocket=]] | [[https://getpocket.com][Pocket]] bookmarks and highlights |
| [[https://github.com/karlicoss/HPI/tree/master/my/polar.py][=my.polar=]] | [[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights |
| [[https://github.com/karlicoss/HPI/tree/master/my/reddit.py][=my.reddit=]] | Reddit data: saved items/comments/upvotes/etc. |
| [[https://github.com/karlicoss/HPI/tree/master/my/rescuetime.py][=my.rescuetime=]] | Rescuetime (phone activity tracking) data. |
| [[https://github.com/karlicoss/HPI/tree/master/my/roamresearch.py][=my.roamresearch=]] | [[https://roamresearch.com][Roam]] data |
| [[https://github.com/karlicoss/HPI/tree/master/my/rss/all.py][=my.rss.all=]] | Unified RSS data, merged from different services I used historically |
| [[https://github.com/karlicoss/HPI/tree/master/my/rss/feedbin.py][=my.rss.feedbin=]] | Feedbin RSS reader |
| [[https://github.com/karlicoss/HPI/tree/master/my/rss/feedly.py][=my.rss.feedly=]] | Feedly RSS reader |
| [[https://github.com/karlicoss/HPI/tree/master/my/rtm.py][=my.rtm=]] | [[https://rememberthemilk.com][Remember The Milk]] tasks and notes |
| [[https://github.com/karlicoss/HPI/tree/master/my/runnerup.py][=my.runnerup=]] | [[https://github.com/jonasoreland/runnerup][Runnerup]] exercise data (TCX format) |
| [[https://github.com/karlicoss/HPI/tree/master/my/smscalls.py][=my.smscalls=]] | Phone calls and SMS messages |
| [[https://github.com/karlicoss/HPI/tree/master/my/stackexchange/gdpr.py][=my.stackexchange.gdpr=]] | Stackexchange data (uses [[https://stackoverflow.com/legal/gdpr/request][official GDPR export]]) |
| [[https://github.com/karlicoss/HPI/tree/master/my/stackexchange/stexport.py][=my.stackexchange.stexport=]] | Stackexchange data (uses API via [[https://github.com/karlicoss/stexport][stexport]]) |
| [[https://github.com/karlicoss/HPI/tree/master/my/taplog.py][=my.taplog=]] | [[https://play.google.com/store/apps/details?id=com.waterbear.taglog][Taplog]] app data |
| [[https://github.com/karlicoss/HPI/tree/master/my/time/tz/main.py][=my.time.tz.main=]] | Timezone data provider, used to localize timezone-unaware timestamps for other modules |
| [[https://github.com/karlicoss/HPI/tree/master/my/time/tz/via_location.py][=my.time.tz.via_location=]] | Timezone data provider, guesses timezone based on location data (e.g. GPS) |
| [[https://github.com/karlicoss/HPI/tree/master/my/twitter/all.py][=my.twitter.all=]] | Unified Twitter data (merged from the archive and periodic updates) |
| [[https://github.com/karlicoss/HPI/tree/master/my/twitter/archive.py][=my.twitter.archive=]] | Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]]) |
| [[https://github.com/karlicoss/HPI/tree/master/my/twitter/twint.py][=my.twitter.twint=]] | Twitter data (tweets and favorites). Uses [[https://github.com/twintproject/twint][Twint]] data export. |
| [[https://github.com/karlicoss/HPI/tree/master/my/vk/vk_messages_backup.py][=my.vk.vk_messages_backup=]] | VK data (exported by [[https://github.com/Totktonada/vk_messages_backup][Totktonada/vk_messages_backup]]) |
:END: :END:
Some modules are private, and need a bit of cleanup before merging: Some modules are private, and need a bit of cleanup before merging:
@ -234,33 +280,61 @@ Some modules are private, and need a bit of cleanup before merging:
#+html: <div id="usecases"><div> #+html: <div id="usecases"></div>
* How do you use it? * How do you use it?
:PROPERTIES:
:CUSTOM_ID: usecases
:END:
Mainly I use it as a data provider for my scripts, tools, and dashboards. Mainly I use it as a data provider for my scripts, tools, and dashboards.
Also, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]]. Also, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]]. It might be helpful for understanding what's my vision on HPI.
It's a draft at the moment, but it might be helpful for understanding what's my vision on HPI.
** Instant search ** Instant search
:PROPERTIES:
:CUSTOM_ID: search
:END:
Typical search interfaces make me unhappy as they are *siloed, slow, awkward to use and don't work offline*. Typical search interfaces make me unhappy as they are *siloed, slow, awkward to use and don't work offline*.
So I built my own ways around it! I write about it in detail [[https://beepb00p.xyz/pkm-search.html#personal_information][here]]. So I built my own ways around it! I write about it in detail [[https://beepb00p.xyz/pkm-search.html#personal_information][here]].
In essence, I'm mirroring most of my online data like chat logs, comments, etc., as plaintext. In essence, I'm mirroring most of my online data like chat logs, comments, etc., as plaintext.
I can overview it in any text editor, and incrementally search over *all of it* in a single keypress. I can overview it in any text editor, and incrementally search over *all of it* in a single keypress.
** orger ** orger
[[https://github.com/karlicoss/orger][orger]] is a tool and set of modules for accessing data via org-mode. :PROPERTIES:
It allows searching and overviewing, and in addition, I'm using it for creating tasks straight from native app interfaces (e.g. Reddit/Telegram) and spaced repetition via [[https://orgmode.org/worg/org-contrib/org-drill.html][org-drill]]. :CUSTOM_ID: orger
:END:
[[https://github.com/karlicoss/orger][orger]] is a tool that helps you generate an org-mode representation of your data.
I write about it in detail [[https://beepb00p.xyz/orger.html][here]] and [[https://beepb00p.xyz/orger-todos.html][here]]. It lets you benefit from the existing tooling and infrastructure around org-mode, the most famous being Emacs.
I'm using it for:
- searching, overviewing and navigating the data
- creating tasks straight from the apps (e.g. Reddit/Telegram)
- spaced repetition via [[https://orgmode.org/worg/org-contrib/org-drill.html][org-drill]]
Orger comes with some existing [[https://github.com/karlicoss/orger/tree/master/modules][modules]], but it should be easy to adapt your own data source if you need something else.
I write about it in detail [[http://beepb00p.xyz/orger.html][here]] and [[http://beepb00p.xyz/orger-todos.html][here]].
** promnesia ** promnesia
:PROPERTIES:
:CUSTOM_ID: promnesia
:END:
[[https://github.com/karlicoss/promnesia#demo][promnesia]] is a browser extension I'm working on to escape silos by *unifying annotations and browsing history* from different data sources. [[https://github.com/karlicoss/promnesia#demo][promnesia]] is a browser extension I'm working on to escape silos by *unifying annotations and browsing history* from different data sources.
I've been using it for more than a year now and working on final touches to properly release it for other people. I've been using it for more than a year now and working on final touches to properly release it for other people.
** dashboard ** dashboard
:PROPERTIES:
:CUSTOM_ID: dashboard
:END:
As a big fan of [[https://beepb00p.xyz/tags.html#quantified-self][#quantified-self]], I'm working on personal health, sleep and exercise dashboard, built from various data sources. As a big fan of [[https://beepb00p.xyz/tags.html#quantified-self][#quantified-self]], I'm working on personal health, sleep and exercise dashboard, built from various data sources.
I'm working on making it public, you can see some screenshots [[https://www.reddit.com/r/QuantifiedSelf/comments/cokt4f/what_do_you_all_do_with_your_data/ewmucgk][here]]. I'm working on making it public, you can see some screenshots [[https://www.reddit.com/r/QuantifiedSelf/comments/cokt4f/what_do_you_all_do_with_your_data/ewmucgk][here]].
** timeline ** timeline
:PROPERTIES:
:CUSTOM_ID: timeline
:END:
Timeline is a [[https://beepb00p.xyz/tags.html#lifelogging][#lifelogging]] project I'm working on. Timeline is a [[https://beepb00p.xyz/tags.html#lifelogging][#lifelogging]] project I'm working on.
I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened. I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened.
@ -270,15 +344,20 @@ Ideally, it would look similar to Andrew Louis's [[https://hyfen.net/memex][Meme
he open sources it. I highly recommend watching his talk for inspiration. he open sources it. I highly recommend watching his talk for inspiration.
* Ad-hoc and interactive * Ad-hoc and interactive
:PROPERTIES:
:CUSTOM_ID: interactive
:END:
** What were my music listening stats for 2018? ** What were my music listening stats for 2018?
:PROPERTIES:
:CUSTOM_ID: lastfm
:END:
Single import away from getting tracks you listened to: Single import away from getting tracks you listened to:
#+begin_src python #+begin_src python
from my.lastfm import get_scrobbles from my.lastfm import scrobbles
scrobbles = get_scrobbles() list(scrobbles())[200: 205]
scrobbles[200: 205]
#+end_src #+end_src
@ -289,16 +368,15 @@ Single import away from getting tracks you listened to:
: Scrobble(raw={'album': 'Rolled Gold +', 'artist': 'The Rolling Stones', 'date': '1282494161', 'name': "You Can't Always Get What You Want"})] : Scrobble(raw={'album': 'Rolled Gold +', 'artist': 'The Rolling Stones', 'date': '1282494161', 'name': "You Can't Always Get What You Want"})]
Or, as a pandas frame to make it pretty: Or, as a pretty Pandas frame:
#+begin_src python #+begin_src python
import pandas as pd import pandas as pd
df = pd.DataFrame([{ df = pd.DataFrame([{
'dt': s.dt, 'dt': s.dt,
'track': s.track, 'track': s.track,
} for s in scrobbles]) } for s in scrobbles()]).set_index('dt')
cdf = df.set_index('dt') df[200: 205]
cdf[200: 205]
#+end_src #+end_src
@ -318,48 +396,59 @@ We can use [[https://github.com/martijnvermaat/calmap][calmap]] library to plot
plt.figure(figsize=(10, 2.3)) plt.figure(figsize=(10, 2.3))
import calmap import calmap
cdf = cdf.set_index(cdf.index.tz_localize(None)) # calmap expects tz-unaware dates df = df.set_index(df.index.tz_localize(None)) # calmap expects tz-unaware dates
calmap.yearplot(cdf['track'], how='count', year=2018) calmap.yearplot(df['track'], how='count', year=2018)
plt.tight_layout() plt.tight_layout()
plt.title('My music listening activity for 2018') plt.title('My music listening activity for 2018')
plot_file = 'lastfm_2018.png' plot_file = 'hpi_files/lastfm_2018.png'
plt.savefig(plot_file) plt.savefig(plot_file)
plot_file plot_file
#+end_src #+end_src
[[https://beepb00p.xyz/lastfm_2018.png]] [[https://beepb00p.xyz/hpi_files/lastfm_2018.png]]
This isn't necessarily very insightful data, but fun to look at now and then! This isn't necessarily very insightful data, but fun to look at now and then!
** What are the most interesting Slate Star Codex posts I've read? ** What are the most interesting Slate Star Codex posts I've read?
:PROPERTIES:
:CUSTOM_ID: hypothesis_stats
:END:
My friend asked me if I could recommend them posts I found interesting on [[https://slatestarcodex.com][Slate Star Codex]]. My friend asked me if I could recommend them posts I found interesting on [[https://slatestarcodex.com][Slate Star Codex]].
With few lines of Python I can quickly recommend them posts I engaged most with, i.e. the ones I annotated most on [[https://hypothes.is][Hypothesis]]. With few lines of Python I can quickly recommend them posts I engaged most with, i.e. the ones I annotated most on [[https://hypothes.is][Hypothesis]].
#+begin_src python #+begin_src python
from my.hypothesis import get_pages from my.hypothesis import pages
from collections import Counter from collections import Counter
cc = Counter({p.url: len(p.highlights) for p in get_pages() if 'slatestarcodex' in p.url}) cc = Counter({(p.title + ' ' + p.url): len(p.highlights) for p in pages() if 'slatestarcodex' in p.url})
return cc.most_common(10) return cc.most_common(10)
#+end_src #+end_src
| http://slatestarcodex.com/2013/10/20/the-anti-reactionary-faq/ | 32 | | The Anti-Reactionary FAQ http://slatestarcodex.com/2013/10/20/the-anti-reactionary-faq/ | 32 |
| https://slatestarcodex.com/2013/03/03/reactionary-philosophy-in-an-enormous-planet-sized-nutshell/ | 17 | | Reactionary Philosophy In An Enormous, Planet-Sized Nutshell https://slatestarcodex.com/2013/03/03/reactionary-philosophy-in-an-enormous-planet-sized-nutshell/ | 17 |
| http://slatestarcodex.com/2014/12/17/the-toxoplasma-of-rage/ | 16 | | The Toxoplasma Of Rage http://slatestarcodex.com/2014/12/17/the-toxoplasma-of-rage/ | 16 |
| https://slatestarcodex.com/2014/03/17/what-universal-human-experiences-are-you-missing-without-realizing-it/ | 16 | | What Universal Human Experiences Are You Missing Without Realizing It? https://slatestarcodex.com/2014/03/17/what-universal-human-experiences-are-you-missing-without-realizing-it/ | 16 |
| http://slatestarcodex.com/2014/07/30/meditations-on-moloch/ | 12 | | Meditations On Moloch http://slatestarcodex.com/2014/07/30/meditations-on-moloch/ | 12 |
| http://slatestarcodex.com/2015/04/21/universal-love-said-the-cactus-person/ | 11 | | Universal Love, Said The Cactus Person http://slatestarcodex.com/2015/04/21/universal-love-said-the-cactus-person/ | 11 |
| http://slatestarcodex.com/2015/01/01/untitled/ | 11 | | Untitled http://slatestarcodex.com/2015/01/01/untitled/ | 11 |
| https://slatestarcodex.com/2017/02/09/considerations-on-cost-disease/ | 10 | | Considerations On Cost Disease https://slatestarcodex.com/2017/02/09/considerations-on-cost-disease/ | 10 |
| http://slatestarcodex.com/2013/04/25/in-defense-of-psych-treatment-for-attempted-suicide/ | 9 | | In Defense of Psych Treatment for Attempted Suicide http://slatestarcodex.com/2013/04/25/in-defense-of-psych-treatment-for-attempted-suicide/ | 9 |
| https://slatestarcodex.com/2014/09/30/i-can-tolerate-anything-except-the-outgroup/ | 9 | | I Can Tolerate Anything Except The Outgroup https://slatestarcodex.com/2014/09/30/i-can-tolerate-anything-except-the-outgroup/ | 9 |
** Accessing exercise data ** Accessing exercise data
E.g. see use of ~my.workouts~ [[https://beepb00p.xyz/./heartbeats_vs_kcals.html][here]]. :PROPERTIES:
:CUSTOM_ID: exercise
:END:
E.g. see use of ~my.workouts~ [[https://beepb00p.xyz/heartbeats_vs_kcals.html][here]].
** Book reading progress ** Book reading progress
:PROPERTIES:
:CUSTOM_ID: kobo_progress
:END:
I publish my reading stats on [[https://www.goodreads.com/user/show/22191391-dima-gerasimov][Goodreads]] so other people can see what I'm reading/have read, but Kobo [[https://beepb00p.xyz/ideas.html#kobo2goodreads][lacks integration]] with Goodreads. I publish my reading stats on [[https://www.goodreads.com/user/show/22191391-dima-gerasimov][Goodreads]] so other people can see what I'm reading/have read, but Kobo [[https://beepb00p.xyz/ideas.html#kobo2goodreads][lacks integration]] with Goodreads.
I'm using [[https://github.com/karlicoss/kobuddy][kobuddy]] to access my my Kobo data, and I've got a regular task that reminds me to sync my progress once a month. I'm using [[https://github.com/karlicoss/kobuddy][kobuddy]] to access my my Kobo data, and I've got a regular task that reminds me to sync my progress once a month.
@ -368,7 +457,7 @@ The task looks like this:
#+begin_src org #+begin_src org
,* TODO [#C] sync [[https://goodreads.com][reading progress]] with kobo ,* TODO [#C] sync [[https://goodreads.com][reading progress]] with kobo
DEADLINE: <2019-11-24 Sun .+4w -0d> DEADLINE: <2019-11-24 Sun .+4w -0d>
[[eshell: with_my python3 -c 'import my.books.kobo as kobo; kobo.print_progress()']] [[eshell: python3 -c 'import my.kobo; my.kobo.print_progress()']]
#+end_src #+end_src
With a single Enter keypress on the inlined =eshell:= command I can print the progress and fill in the completed books on Goodreads, e.g.: With a single Enter keypress on the inlined =eshell:= command I can print the progress and fill in the completed books on Goodreads, e.g.:
@ -398,6 +487,9 @@ With a single Enter keypress on the inlined =eshell:= command I can print the pr
#+end_example #+end_example
** Messenger stats ** Messenger stats
:PROPERTIES:
:CUSTOM_ID: messenger_stats
:END:
How much do I chat on Facebook Messenger? How much do I chat on Facebook Messenger?
#+begin_src python #+begin_src python
@ -419,19 +511,64 @@ How much do I chat on Facebook Messenger?
x_labels = df.index.strftime('%Y %b') x_labels = df.index.strftime('%Y %b')
ax.set_xticklabels(x_labels) ax.set_xticklabels(x_labels)
plot_file = 'messenger_2016_to_2019.png' plot_file = 'hpi_files/messenger_2016_to_2019.png'
plt.tight_layout() plt.tight_layout()
plt.savefig(plot_file) plt.savefig(plot_file)
return plot_file return plot_file
#+end_src #+end_src
[[https://beepb00p.xyz/messenger_2016_to_2019.png]] [[https://beepb00p.xyz/hpi_files/messenger_2016_to_2019.png]]
** Which month in 2020 did I make the most git commits in?
:PROPERTIES:
:CUSTOM_ID: hpi_query_git
:END:
If you like the shell or just want to quickly convert/grab some information from HPI, it also comes with a JSON query interface - so you can export the data, or just pipeline to your heart's content:
#+begin_src bash
$ hpi query my.coding.commits.commits --stream # stream JSON objects as they're read
--order-type datetime # find the 'datetime' attribute and order by that
--after '2020-01-01' --before '2021-01-01' # in 2020
| jq '.committed_dt' -r # extract the datetime
# mangle the output a bit to group by month and graph it
| cut -d'-' -f-2 | sort | uniq -c | awk '{print $2,$1}' | sort -n | termgraph
#+end_src
#+begin_src
2020-01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 458.00
2020-02: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 440.00
2020-03: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 545.00
2020-04: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 585.00
2020-05: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 518.00
2020-06: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 755.00
2020-07: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 467.00
2020-08: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 449.00
2020-09: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.03 K
2020-10: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 791.00
2020-11: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 474.00
2020-12: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 383.00
#+end_src
See [[https://github.com/karlicoss/HPI/blob/master/doc/QUERY.md][query docs]]
for more examples
** Querying Roam Research database
:PROPERTIES:
:CUSTOM_ID: roamresearch
:END:
I've got some code examples [[https://beepb00p.xyz/myinfra-roam.html#interactive][here]].
* How does it get input data? * How does it get input data?
:PROPERTIES:
:CUSTOM_ID: input_data
:END:
If you're curious about any specific data sources I'm using, I've written it up [[https://beepb00p.xyz/my-data.html][in detail]]. If you're curious about any specific data sources I'm using, I've written it up [[https://beepb00p.xyz/my-data.html][in detail]].
Also see [[https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#data-flow]["Data flow"]] documentation with some nice diagrams explaining on specific examples.
In short: In short:
- The data is [[https://beepb00p.xyz/myinfra.html#exports][periodically synchronized]] from the services (cloud or not) locally, on the filesystem - The data is [[https://beepb00p.xyz/myinfra.html#exports][periodically synchronized]] from the services (cloud or not) locally, on the filesystem
@ -452,8 +589,15 @@ I consider it a necessary sacrifice to make everything fast and resilient.
In theory, it's possible to make the system almost realtime by having a service that sucks in data continuously (rather than periodically), but it's harder as well. In theory, it's possible to make the system almost realtime by having a service that sucks in data continuously (rather than periodically), but it's harder as well.
* Q & A * Q & A
:PROPERTIES:
:CUSTOM_ID: q_and_a
:END:
** Why Python? ** Why Python?
:PROPERTIES:
:CUSTOM_ID: why_python
:END:
I don't consider Python unique as a language suitable for such a project. I don't consider Python unique as a language suitable for such a project.
It just happens to be the one I'm most comfortable with. It just happens to be the one I'm most comfortable with.
I do have some reasons that I think make it /specifically/ good, but explaining them is out of this post's scope. I do have some reasons that I think make it /specifically/ good, but explaining them is out of this post's scope.
@ -466,15 +610,21 @@ I've heard LISPs are great for data? ;)
Overall, I wish [[https://en.wikipedia.org/wiki/Foreign_function_interface][FFIs]] were a bit more mature, so we didn't have to think about specific programming languages at all. Overall, I wish [[https://en.wikipedia.org/wiki/Foreign_function_interface][FFIs]] were a bit more mature, so we didn't have to think about specific programming languages at all.
** Can anyone use it? ** Can anyone use it?
:PROPERTIES:
:CUSTOM_ID: can_anyone_use_it
:END:
Yes! Yes!
- you can plug in your own data - you can plug in *your own data*
- most modules are isolated, so you can only use the ones that you want to - most modules are isolated, so you can only use the ones that you want to
- everything is easily extensible - everything is easily *extensible*
Starting from simply adding new modules to any dynamic hackery you can possibly imagine within Python. Starting from simply adding new modules to any dynamic hackery you can possibly imagine within Python.
** How easy is it to use? ** How easy is it to use?
:PROPERTIES:
:CUSTOM_ID: how_easy_to_use
:END:
The whole setup requires some basic programmer literacy: The whole setup requires some basic programmer literacy:
- installing/running and potentially modifying Python code - installing/running and potentially modifying Python code
@ -484,9 +634,12 @@ The whole setup requires some basic programmer literacy:
If you have any ideas on making the setup simpler, please let me know! If you have any ideas on making the setup simpler, please let me know!
** What about privacy? ** What about privacy?
The modules contain no data, only code to operate on the data. :PROPERTIES:
:CUSTOM_ID: privacy
:END:
The modules contain *no data, only code* to operate on the data.
Everything is [[https://beepb00p.xyz/tags.html#offline][local fist]], the input data is on your filesystem. Everything is [[https://beepb00p.xyz/tags.html#offline][*local first*]], the input data is on your filesystem.
If you're truly paranoid, you can even wrap it in a Docker container. If you're truly paranoid, you can even wrap it in a Docker container.
There is still a question of whether you trust yourself at even keeping all the data on your disk, but it is out of the scope of this post. There is still a question of whether you trust yourself at even keeping all the data on your disk, but it is out of the scope of this post.
@ -494,6 +647,10 @@ There is still a question of whether you trust yourself at even keeping all the
If you'd rather keep some code private too, it's also trivial to achieve with a private subpackage. If you'd rather keep some code private too, it's also trivial to achieve with a private subpackage.
** But /should/ I use it? ** But /should/ I use it?
:PROPERTIES:
:CUSTOM_ID: should_i_use_it
:END:
#+begin_quote #+begin_quote
Sure, maybe you can achieve a perfect system where you can instantly find and recall anything that you've done. Do you really want it? Sure, maybe you can achieve a perfect system where you can instantly find and recall anything that you've done. Do you really want it?
Wouldn't that, like, make you less human? Wouldn't that, like, make you less human?
@ -511,10 +668,14 @@ I can clearly delegate some tasks, like long term memory, information lookup, an
What about these people who have perfect recall and wish they hadn't. What about these people who have perfect recall and wish they hadn't.
#+end_quote #+end_quote
Sure, maybe it sucks. At the moment though, I don't anything close to it and this only annoys me. Sure, maybe it sucks. At the moment though, my recall is far from perfect, and this only annoys me.
I want to have a choice at least, and digital tools give me this choice. I want to have a choice at least, and digital tools give me this choice.
** Would it suit /me/? ** Would it suit /me/?
:PROPERTIES:
:CUSTOM_ID: would_it_suit_me
:END:
Probably, at least to some extent. Probably, at least to some extent.
First, our lives are different, so our APIs might be different too. First, our lives are different, so our APIs might be different too.
@ -534,7 +695,11 @@ but I still feel that wouldn't be enough.
I'm not sure whether it's a solvable problem at this point, but happy to hear any suggestions! I'm not sure whether it's a solvable problem at this point, but happy to hear any suggestions!
** What it isn't? ** What it isn't?
- It's not vaporwave :PROPERTIES:
:CUSTOM_ID: what_it_isnt
:END:
- It's not vaporware
The project is a little crude, but it's real and working. I've been using it for a long time now, and find it fairly sustainable to keep using for the foreseeable future. The project is a little crude, but it's real and working. I've been using it for a long time now, and find it fairly sustainable to keep using for the foreseeable future.
@ -547,23 +712,51 @@ I'm not sure whether it's a solvable problem at this point, but happy to hear an
Please take my ideas and code and build something cool from it! Please take my ideas and code and build something cool from it!
* HPI Repositories
:PROPERTIES:
:CUSTOM_ID: hpi_repos
:END:
One of HPI's core goals is to be as extendable as possible. The goal here isn't to become a monorepo and support every possible data source/website to the point that this isn't maintainable anymore, but hopefully you get a few modules 'for free'.
If you want to write modules for personal use but don't want to merge them into here, you're free to maintain modules locally in a separate directory to avoid any merge conflicts, and entire HPI repositories can even be published separately and installed into the single ~my~ python package (For more info on this, see [[https://github.com/karlicoss/HPI/tree/master/doc/MODULE_DESIGN.org][MODULE_DESIGN]])
Other HPI Repositories:
- [[https://github.com/purarue/HPI][purarue/HPI]]
- [[https://github.com/madelinecameron/hpi][madelinecameron/HPI]]
If you want to create your own to create your own modules/override something here, you can use the [[https://github.com/purarue/HPI-template][template]].
* Related links * Related links
:PROPERTIES:
:CUSTOM_ID: links
:END:
Similar projects: Similar projects:
- [[https://hyfen.net/memex][Memex]] by Andrew Louis
- [[https://github.com/novoid/Memacs][Memacs]] by Karl Voit - [[https://github.com/novoid/Memacs][Memacs]] by Karl Voit
- [[https://news.ycombinator.com/item?id=9615901][Me API - turn yourself into an open API (HN)]] - [[https://news.ycombinator.com/item?id=9615901][Me API - turn yourself into an open API (HN)]]
- [[https://github.com/markwk/qs_ledger][QS ledger]] from Mark Koester - [[https://github.com/markwk/qs_ledger][QS ledger]] from Mark Koester
- [[https://dogsheep.github.io][Dogsheep]]: a collection of tools for personal analytics using SQLite and Datasette
- [[https://github.com/tehmantra/my][tehmantra/my]]: directly inspired by this package - [[https://github.com/tehmantra/my][tehmantra/my]]: directly inspired by this package
- [[https://github.com/bcongdon/bolero][bcongdon/bolero]] - [[https://github.com/bcongdon/bolero][bcongdon/bolero]]: exposes your personal data as a REST API
- [[https://en.wikipedia.org/wiki/Solid_(web_decentralization_project)#Design][Solid project]]: personal data pod, which websites pull data from - [[https://en.wikipedia.org/wiki/Solid_(web_decentralization_project)#Design][Solid project]]: personal data pod, which websites pull data from
- [[https://remotestorage.io][remoteStorage]]: open protocol for apps to write data to your own storage
- [[https://perkeep.org]][Perkeep]: a tool with [[https://perkeep.org/doc/principles]][principles] and esp. [[https://perkeep.org/doc/uses]][use cases] for self-sovereign storage of personal data
- [[https://www.openhumans.org]][Open Humans]: a community and infrastructure to analyse and share personal data
Other links: Other links:
- NetOpWibby: [[https://news.ycombinator.com/item?id=21684949][A Personal API (HN)]] - NetOpWibby: [[https://news.ycombinator.com/item?id=21684949][A Personal API (HN)]]
- [[https://beepb00p.xyz/sad-infra.html][The sad state of personal data and infrastructure]]: here I am going into motivation and difficulties arising in the implementation - [[https://beepb00p.xyz/sad-infra.html][The sad state of personal data and infrastructure]]: here I am going into motivation and difficulties arising in the implementation
- [[https://beepb00p.xyz/myinfra-roam.html][Extending my personal infrastructure]]: a followup, where I'm demonstrating how to integrate a new data source (Roam Research)
* -- * --
:PROPERTIES:
:CUSTOM_ID: fin
:END:
Open to any feedback and thoughts! Open to any feedback and thoughts!
Also, don't hesitate to raise an issue, or reach me personally if you want to try using it, and find the instructions confusing. Your questions would help me to make it simpler! Also, don't hesitate to raise an issue, or reach me personally if you want to try using it, and find the instructions confusing. Your questions would help me to make it simpler!

View file

@ -1,17 +0,0 @@
https://github.com/crowoy/Health-Analysis
https://github.com/joytafty-work/SleepModel
https://github.com/search?l=Jupyter+Notebook&q=s_awakenings&type=Code&utf8=%E2%9C%93
https://github.com/oshev/colifer/blob/592cc6b4d1ac9005c52fccdfb4e207513812baaa/colifer.py
https://github.com/oshev/colifer/blob/592cc6b4d1ac9005c52fccdfb4e207513812baaa/reportextenders/jawbone/jawbone_sleep.py
https://github.com/GlenCrawford/ruby_jawbone
* https://nyquist212.wordpress.com/2015/06/22/visualizing-jawbone-up-data-with-d3-js/
* TODO ok, so shoud really do a week of consistent bedtime/waking up to make some final decision on jawbone?
* TODO figure out timezones
* TODO post on reddit? release and ask people to run against their data?
* TODO [2019-12-19 Thu 19:53] hmm, if package isn't using mycfg then we don't really need it?

47
conftest.py Normal file
View file

@ -0,0 +1,47 @@
# this is a hack to monkey patch pytest so it handles tests inside namespace packages without __init__.py properly
# without it, pytest can't discover the package root for some reason
# also see https://github.com/karlicoss/pytest_namespace_pkgs for more
import os
import pathlib
from typing import Optional
import _pytest.main
import _pytest.pathlib
# we consider all dirs in repo/ to be namespace packages
root_dir = pathlib.Path(__file__).absolute().parent.resolve() # / 'src'
assert root_dir.exists(), root_dir
# TODO assert it contains package name?? maybe get it via setuptools..
namespace_pkg_dirs = [str(d) for d in root_dir.iterdir() if d.is_dir()]
# resolve_package_path is called from _pytest.pathlib.import_path
# takes a full abs path to the test file and needs to return the path to the 'root' package on the filesystem
resolve_pkg_path_orig = _pytest.pathlib.resolve_package_path
def resolve_package_path(path: pathlib.Path) -> Optional[pathlib.Path]:
result = path # search from the test file upwards
for parent in result.parents:
if str(parent) in namespace_pkg_dirs:
return parent
if os.name == 'nt':
# ??? for some reason on windows it is trying to call this against conftest? but not on linux/osx
if path.name == 'conftest.py':
return resolve_pkg_path_orig(path)
raise RuntimeError("Couldn't determine path for ", path)
_pytest.pathlib.resolve_package_path = resolve_package_path
# without patching, the orig function returns just a package name for some reason
# (I think it's used as a sort of fallback)
# so we need to point it at the absolute path properly
# not sure what are the consequences.. maybe it wouldn't be able to run against installed packages? not sure..
search_pypath_orig = _pytest.main.search_pypath
def search_pypath(module_name: str) -> str:
mpath = root_dir / module_name.replace('.', os.sep)
if not mpath.is_dir():
mpath = mpath.with_suffix('.py')
assert mpath.exists(), mpath # just in case
return str(mpath)
_pytest.main.search_pypath = search_pypath

42
demo.py
View file

@ -1,24 +1,36 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
from subprocess import check_call, DEVNULL from subprocess import check_call, DEVNULL
from shutil import copy, copytree from shutil import copytree, ignore_patterns
import os import os
from os.path import abspath from os.path import abspath
from sys import executable as python
from pathlib import Path from pathlib import Path
my_repo = Path(__file__).absolute().parent my_repo = Path(__file__).absolute().parent
def run(): def run() -> None:
# uses fixed paths; worth it for the sake of demonstration # uses fixed paths; worth it for the sake of demonstration
# assumes we're in /tmp/my_demo now # assumes we're in /tmp/my_demo now
# 1. clone git@github.com:karlicoss/my.git # 1. clone git@github.com:karlicoss/my.git
copytree(my_repo, 'my_repo', symlinks=True) copytree(
my_repo,
'my_repo',
symlinks=True,
ignore=ignore_patterns('.tox*'), # tox dir might have broken symlinks while tests are running in parallel
)
# 2. prepare repositories you'd be using. For this demo we only set up Hypothesis # 2. prepare repositories you'd be using. For this demo we only set up Hypothesis
hypothesis_repo = abspath('hypothesis_repo') tox = 'TOX' in os.environ
check_call(['git', 'clone', 'https://github.com/karlicoss/hypexport.git', hypothesis_repo]) if tox: # tox doesn't like --user flag
# check_call(f'{python} -m pip install git+https://github.com/karlicoss/hypexport.git'.split())
else:
try:
import hypexport
except ModuleNotFoundError:
check_call(f'{python} -m pip --user git+https://github.com/karlicoss/hypexport.git'.split())
# 3. prepare some demo Hypothesis data # 3. prepare some demo Hypothesis data
hypothesis_backups = abspath('backups/hypothesis') hypothesis_backups = abspath('backups/hypothesis')
@ -31,8 +43,8 @@ def run():
# #
# 4. point my.config to the Hypothesis data # 4. point my.config to the Hypothesis data
mycfg_root = abspath('my_repo/mycfg_template') mycfg_root = abspath('my_repo')
init_file = Path(mycfg_root) / 'my/config/__init__.py' init_file = Path(mycfg_root) / 'my/config.py'
init_file.write_text(init_file.read_text().replace( init_file.write_text(init_file.read_text().replace(
'/path/to/hypothesis/data', '/path/to/hypothesis/data',
hypothesis_backups, hypothesis_backups,
@ -42,10 +54,10 @@ def run():
# 4. now we can use it! # 4. now we can use it!
os.chdir(my_repo) os.chdir(my_repo)
check_call(['python3', '-c', ''' check_call([python, '-c', '''
import my.hypothesis import my.hypothesis
pages = my.hypothesis.get_pages() pages = my.hypothesis.pages()
from itertools import islice from itertools import islice
for page in islice(pages, 0, 8): for page in islice(pages, 0, 8):
@ -100,13 +112,17 @@ def named_temp_dir(name: str):
""" """
Fixed name tmp dir Fixed name tmp dir
""" """
td = (Path('/tmp') / name) import tempfile
td = Path(tempfile.gettempdir()) / name
try: try:
td.mkdir(exist_ok=False) td.mkdir(exist_ok=False)
yield td yield td
finally: finally:
import shutil import os, shutil
shutil.rmtree(str(td)) skip_cleanup = 'CI' in os.environ and os.name == 'nt'
# TODO hmm for some reason cleanup on windows causes AccessError
if not skip_cleanup:
shutil.rmtree(str(td))
def main(): def main():

301
doc/CONFIGURING.org Normal file
View file

@ -0,0 +1,301 @@
This doc describes the technical decisions behind HPI configuration system.
It's more of a 'design doc' rather than usage guide.
If you just want to know how to set up HPI or configure it, see [[file:SETUP.org][SETUP]].
I feel like it's good to keep the rationales in the documentation,
but happy to [[https://github.com/karlicoss/HPI/issues/46][discuss]] it here.
Before discussing the abstract matters, let's consider a specific situation.
Say, we want to let the user configure [[https://github.com/karlicoss/HPI/blob/master/my/bluemaestro/__init__.py][bluemaestro]] module.
At the moment, it uses the following config attributes:
- ~export_path~
Path to the data, this is obviously a *required* attribute
- ~cache_path~
Cache is extremely useful to speed up some queries. But it's *optional*, everything should work without it.
I'll refer to this config as *specific* further in the doc, and give examples. to each point. Note that they are only illustrating the specific requirement, potentially ignoring the other ones.
Now, the requirements as I see it:
1. configuration should be *extremely* flexible
We need to make sure it's very easy to combine/filter/extend data without having to turn the module code inside out.
This means using a powerful language for config, and realistically, a Turing complete.
General: that means that you should be able to use powerful syntax, potentially running arbitrary code if
this is something you need (for whatever mad reason). It should be possible to override config attributes *in runtime*, if necessary, without rewriting files on the filesystem.
Specific: we've got Python already, so it makes a lot of sense to use it!
#+begin_src python
class bluemaestro:
export_path = '/path/to/bluemaestro/data'
cache_path = '/tmp/bluemaestro.cache'
#+end_src
Downsides:
- keeping it overly flexible and powerful means it's potentially less accessible to people less familiar with programming
But see the further point about keeping it simple. I claim that simple programs look as easy as simple JSON.
- Python is 'less safe' than a plain JSON/YAML config
But at the moment the whole thing is running potentially untrusted Python code anyway.
It's not a tool you're going to install it across your organization, run under root privileges, and let the employers tweak it.
Ultimately, you set it up for yourself, and the config has exactly the same permissions as the code you're installing.
Thinking that plain config would give you more security is deceptive, and it's a false sense of security (at this stage of the project).
# TODO I don't mind having JSON/TOML/whatever, but only as an additional interface
I also write more about all this [[https://beepb00p.xyz/configs-suck.html][here]].
2. configuration should be *backwards compatible*
General: the whole system is pretty chaotic, it's hard to control the versioning of different modules and their compatibility.
It's important to allow changing attribute names and adding new functionality, while making sure the module works against an older version of the config.
Ideally warn the user that they'd better migrate to a newer version if the fallbacks are triggered.
Potentially: use individual versions for modules? Although it makes things a bit complicated.
Specific: say the module is using a new config attribute, ~timezone~.
We would need to adapt the module to support the old configs without timezone. For example, in ~bluemaestro.py~ (pseudo code):
#+begin_src python
user_config = load_user_config()
if not hasattr(user_config, 'timezone'):
warnings.warn("Please specify 'timezone' in the config! Falling back to the system timezone.")
user_config.timezone = get_system_timezone()
#+end_src
This is possible to achieve with pretty much any config format, just important to keep in mind.
Downsides: hopefully no one argues backwards compatibility is important.
3. configuration should be as *easy to write* as possible
General: as lean and non-verbose as possible. No extra imports, no extra inheritance, annotations, etc. Loose coupling.
Specific: the user *only* has to specify ~export_path~ to make the module function and that's it. For example:
#+begin_src js
{
'export_path': '/path/to/bluemaestro/'
}
#+end_src
It's possible to achieve with any configuration format (aided by some helpers to fill in optional attributes etc), so it's more of a guiding principle.
Downsides:
- no (mandatory) annotations means more potential to break, but I'd rather leave this decision to the users
4. configuration should be as *easy to use and extend* as possible
General: enable the users to add new config attributes and *immediately* use them without any hassle and boilerplate.
It's easy to achieve on it's own, but harder to achieve simultaneously with (2).
Specific: if you keep the config as Python, simply importing the config in the module satisfies this property:
#+begin_src python
from my.config import bluemaestro as user_config
#+end_src
If the config is in JSON or something, it's possible to load it dynamically too without the boilerplate.
Downsides: none, hopefully no one is against extensibility
5. configuration should have checks
General: make sure it's easy to track down configuration errors. At least runtime checks for required attributes, their types, warnings, that sort of thing. But a biggie for me is using *mypy* to statically typecheck the modules.
To some extent it gets in the way of (2) and (4).
Specific: using ~NamedTuple/dataclass~ has capabilities to verify the config with no extra boilerplate on the user side.
#+begin_src python
class bluemaestro(NamedTuple):
export_path: str
cache_path : Optional[str] = None
raw_config = json.load('configs/bluemaestro.json')
config = bluemaestro(**raw_config)
#+end_src
This will fail if required =export_path= is missing, and fill optional =cache_path= with None. In addition, it's ~mypy~ friendly.
Downsides: none, especially if it's possible to turn checks on/off.
6. configuration should be easy to document
General: ideally, it should be autogenerated, be self-descriptive and have some sort of schema, to make sure the documentation (which no one likes to write) doesn't diverge.
Specific: mypy annotations seem like the way to go. See the example from (5), it's pretty clear from the code what needs to be in the config.
Downsides: none, self-documented code is good.
* Solution?
Now I'll consider potential solutions to the configuration, taking the different requirements into account.
Like I already mentioned, plain configs (JSON/YAML/TOML) are very inflexible and go against (1), which in my opinion think makes them no-go.
So: my suggestion is to write the *configs as Python code*.
It's hard to satisfy all requirements *at the same time*, but I want to argue, it's possible to satisfy most of them, depending on the maturity of the module which we're configuring.
Let's say you want to write a new module. You start with a
#+begin_src python
class bluemaestro:
export_path = '/path/to/bluemaestro/data'
cache_path = '/tmp/bluemaestro.cache'
#+end_src
And to use it:
#+begin_src python
from my.config import bluemaestro as user_config
#+end_src
Let's go through requirements:
- (1): *yes*, simply importing Python code is the most flexible you can get
In addition, in runtime, you can simply assign a new config if you need some dynamic hacking:
#+begin_src python
class new_config:
export_path = '/some/hacky/dynamic/path'
my.config = new_config
#+end_src
After that, =my.bluemaestro= would run against your new config.
- (2): *no*, but backwards compatibility is not necessary in the first version of the module
- (3): *mostly*, although optional fields require extra work
- (4): *yes*, whatever is in the config can immediately be used by the code
- (5): *mostly*, imports are transparent to ~mypy~, although runtime type checks would be nice too
- (6): *no*, you have to guess the config from the usage.
This approach is extremely simple, and already *good enough for initial prototyping* or *private modules*.
The main downside so far is the lack of documentation (6), which I'll try to solve next.
I see mypy annotations as the only sane way to support it, because we also get (5) for free. So we could use:
- potentially [[https://github.com/karlicoss/HPI/issues/12#issuecomment-610038961][file-config]]
However, it's using plain files and doesn't satisfy (1).
Also not sure about (5). =file-config= allows using mypy annotations, but I'm not convinced they would be correctly typed with mypy, I think you need a plugin for that.
- [[https://mypy.readthedocs.io/en/stable/protocols.html#simple-user-defined-protocols][Protocol]]
I experimented with ~Protocol~ [[https://github.com/karlicoss/HPI/pull/45/commits/90b9d1d9c15abe3944913add5eaa5785cc3bffbc][here]].
It's pretty cool, very flexible, and doesn't impose any runtime modifications, which makes it good for (4).
The downsides are:
- it doesn't support optional attributes (optional as in non-required, not as ~typing.Optional~), so it goes against (3)
- prior to python 3.8, it's a part of =typing_extensions= rather than standard =typing=, so using it requires guarding the code with =if typing.TYPE_CHECKING=, which is a bit confusing and bloating.
TODO: check out [[https://mypy.readthedocs.io/en/stable/protocols.html#using-isinstance-with-protocols][@runtime_checkable]]?
- =NamedTuple=
[[https://github.com/karlicoss/HPI/pull/45/commits/c877104b90c9d168eaec96e0e770e59048ce4465][Here]] I experimented with using ~NamedTuple~.
Similarly to Protocol, it's self-descriptive, and in addition allows for non-required fields.
# TODO something about helper methods? can't use them with Protocol
Downsides:
- it goes against (4), because NamedTuple (being a =tuple= in runtime) can only contain the attributes declared in the schema.
- =dataclass=
Similar to =NamedTuple=, but it's possible to add extra attributes =dataclass= with ~setattr~ to implement (4).
Downsides:
- we partially lost (5), because dynamic attributes are not transparent to mypy.
My conclusion was using a *combined approach*:
- Use =@dataclass= base for documentation and default attributes, achieving (6) and (3)
- Inherit the original config class to bring in the extra attributes, achieving (4)
Inheritance is a standard mechanism, which doesn't require any extra frameworks and plays well with other Python concepts. As a specific example:
#+begin_src python
from my.config import bluemaestro as user_config
@dataclass
class bluemaestro(user_config):
'''
The header of this file contributes towards the documentation
'''
export_path: str
cache_path : Optional[str] = None
@classmethod
def make_config(cls) -> 'bluemaestro':
params = {
k: v
for k, v in vars(cls.__base__).items()
if k in {f.name for f in dataclasses.fields(cls)}
}
return cls(**params)
config = bluemaestro.make_config()
#+end_src
I claim this solves pretty much everything:
- *(1)*: yes, the config attributes are preserved and can be anything that's allowed in Python
- *(2)*: collaterally, we also solved it, because we can adapt for renames and other legacy config adaptations in ~make_config~
- *(3)*: supports default attributes, at no extra cost
- *(4)*: the user config's attributes are available through the base class
- *(5)*: everything is mostly transparent to mypy. There are no runtime type checks yet, but I think possible to integrate with ~@dataclass~
- *(6)*: the dataclass header is easily readable, and it's possible to generate the docs automatically
Downsides:
- inheriting from ~user_config~ means an early import of =my.config=
Generally it's better to keep everything as lazy as possible and defer loading to the first time the config is used.
This might be annoying at times, e.g. if you have a top-level import of you module, but no config.
But considering that in 99% of cases config is going to be on the disk
and it's [[https://github.com/karlicoss/HPI/blob/1e6e0bd381d20437343473878c7f63b1f9d6362b/tests/demo.py#L22-L25][possible]] to do something dynamic like =del sys.modules['my.bluemastro']= to reload the config, I think it's a minor issue.
- =make_config= allows for some mypy false negatives in the user config
E.g. if you forgot =export_path= attribute, mypy would miss it. But you'd have a runtime failure, and the downstream code using config is still correctly type checked.
Perhaps it will be better when [[https://github.com/python/mypy/issues/5374][this mypy issue]] is fixed.
- the =make_config= bit is a little scary and manual
However, it's extracted in a generic helper, and [[https://github.com/karlicoss/HPI/blob/d6f071e3b12ba1cd5a86ad80e3821bec004e6a6d/my/twitter/archive.py#L17][ends up pretty simple]]
# In addition, it's not even necessary if you don't have optional attributes, you can simply use the class variables (i.e. ~bluemaestro.export_path~)
# upd. ugh, you can't, it doesn't handle default attributes overriding correctly (see tests/demo.py)
# eh. basically all I need is class level dataclass??
- inheriting from ~user_config~ requires it to be a =class= rather than an =object=
A practical downside is you can't use something like ~SimpleNamespace~.
But considering you can define an ad-hoc =class= anywhere, this is fine?
My conclusion is that I'm going with this approach for now.
Note that at no stage in required any changes to the user configs, so if I missed something, it would be reversible.
* Side modules :noexport:
Some of TODO rexport?
To some extent, this is an experiment. I'm not sure how much value is in .
One thing are TODO software? libraries that have fairly well defined APIs and you can reasonably version them.
Another thing is the modules for accessing data, where you'd hopefully have everything backwards compatible.
Maybe in the future
I'm just not sure, happy to hear people's opinions on this.

12
doc/CONTRIBUTING.org Normal file
View file

@ -0,0 +1,12 @@
doc in progress
- I don't use automatic code formatters (like =black=)
I don't mind if you do, e.g. when you're adding new code or formatting some code you modified, but please don't reformat the whole repository or slip in unrelated code style changes.
In particular I can't stand when formatters mess with vertically aligned code (thus making it less readable!), or conform the code to some arbitrary line length (like 80 symbols).
Of course reasonable formatting improvements (like obvious typos, missing spaces or too dense code) are welcome.
And of course, if we end up collaborating a lot on the project I'm open to discussion if automatic code style is really important to you.
- See [[file:MODULE_DESIGN.org][MODULE_DESIGN.org]] for common practices in HPI

130
doc/DENYLIST.md Normal file
View file

@ -0,0 +1,130 @@
For code reference, see: [`my.core.denylist.py`](../my/core/denylist.py)
A helper module for defining denylists for sources programmatically (in layman's terms, this lets you remove some particular output from a module you don't want)
Lets you specify a class, an attribute to match on,
and a JSON file containing a list of values to deny/filter out
As an example, this will use the `my.ip` module, as filtering incorrect IPs was the original use case for this module:
```python
class IP(NamedTuple):
addr: str
dt: datetime
```
A possible denylist file would contain:
```json
[
{
"addr": "192.168.1.1",
},
{
"dt": "2020-06-02T03:12:00+00:00",
}
]
```
Note that if the value being compared to is not a single (non-array/object) JSON primitive
(str, int, float, bool, None), it will be converted to a string before comparison
To use this in code:
```python
from my.ip.all import ips
filtered = DenyList("~/data/ip_denylist.json").filter(ips())
```
To add items to the denylist, in python (in a one-off script):
```python
from my.ip.all import ips
from my.core.denylist import DenyList
d = DenyList("~/data/ip_denylist.json")
for ip in ips():
# some custom code you define
if ip.addr == ...:
d.deny(key="ip", value=ip.ip)
d.write()
```
... or interactively, which requires [`fzf`](https://github.com/junegunn/fzf) and [`pyfzf-iter`](https://pypi.org/project/pyfzf-iter/) (`python3 -m pip install pyfzf-iter`) to be installed:
```python
from my.ip.all import ips
from my.core.denylist import DenyList
d = DenyList("~/data/ip_denylist.json")
d.deny_cli(ips()) # automatically writes after each selection
```
That will open up an interactive `fzf` prompt, where you can select an item to add to the denylist
This is meant for relatively simple filters, where you want to filter items out
based on a single attribute of a namedtuple/dataclass. If you want to do something
more complex, I would recommend overriding the `all.py` file for that source and
writing your own filter function there.
For more info on all.py:
https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy
This would typically be used in an overridden `all.py` file, or in a one-off script
which you may want to filter out some items from a source, progressively adding more
items to the denylist as you go.
A potential `my/ip/all.py` file might look like (Sidenote: `discord` module from [here](https://github.com/purarue/HPI)):
```python
from typing import Iterator
from my.ip.common import IP
from my.core.denylist import DenyList
deny = DenyList("~/data/ip_denylist.json")
# all possible data from the source
def _ips() -> Iterator[IP]:
from my.ip import discord
# could add other imports here
yield from discord.ips()
# filtered data
def ips() -> Iterator[IP]:
yield from deny.filter(_ips())
```
To add items to the denylist, you could create a `__main__.py` in your namespace package (in this case, `my/ip/__main__.py`), with contents like:
```python
from my.ip import all
if __name__ == "__main__":
all.deny.deny_cli(all.ips())
```
Which could then be called like: `python3 -m my.ip`
Or, you could just run it from the command line:
```
python3 -c 'from my.ip import all; all.deny.deny_cli(all.ips())'
```
To edit the `all.py`, you could either:
- install it as editable (`python3 -m pip install --user -e ./HPI`), and then edit the file directly
- or, create a namespace package, which splits the package across multiple directories. For info on that see [`MODULE_DESIGN`](https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#namespace-packages), [`reorder_editable`](https://github.com/purarue/reorder_editable), and possibly the [`HPI-template`](https://github.com/purarue/HPI-template) to create your own HPI namespace package to create your own `all.py` file.
For a real example of this see, [purarue/HPI-personal](https://github.com/purarue/HPI-personal/blob/master/my/ip/all.py)
Sidenote: the reason why we want to specifically override
the all.py and not just create a script that filters out the items you're
not interested in is because we want to be able to import from `my.ip.all`
or `my.location.all` from other modules and get the filtered results, without
having to mix data filtering logic with parsing/loading/caching (the stuff HPI does)

55
doc/DESIGN.org Normal file
View file

@ -0,0 +1,55 @@
note: this doc is in progress
* main design principles
- interoperable
# note: this link doesn't work in org, but does for the github preview
This is the main motivation and [[file:../README.org#why][why]] I created HPI in the first place.
Ideally it should be possible to hook into anything you can imagine -- regardless the database/programming language/etc.
Check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]] to see how I'm using it.
- extensible
It should be possible for anyone to modify/extent HPI to their own needs, e.g.
- adding new data providers
- patching existing ones
- mixing in custom data sources
See the guide to [[file:SETUP.org::#addingmodifying-modules][extending/modifying HPI]]
- local first/offline
The main idea is to work against data on your disk to provide convenient, fast and robust access.
See [[file:../README.org::#how-does-it-get-input-data]["How does it get input data?"]]
Although in principle there is nothing wrong if you want to hook it to some online API, it's just python code after all!
- reasonably defensive
Data is inherently messy, and it's inevitable to get parsing errors and missing fields now and then.
I'm trying to combat this with [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]],
so you are aware of errors, but still can work with the 'good' subset of data.
- robust
The code is extensively covered with tests & ~mypy~ to make sure it doesn't rot.
I also try to keep everything as backwards compatible as possible.
- (almost) no magic
While I do use dynamic Python's features where it's inevitable or too convenient, I try to keep everything as close to standard Python as possible.
This allows it to:
- be at least as extensible as other Python software
- use mature tools like =pip= or =mypy=
* other docs
- [[file:CONFIGURING.org][some decisions around HPI configuration 'system']]
- [[file:MODULE_DESIGN.org][some thoughts on the modules, their design, and adding new ones]]

View file

@ -1,13 +1,36 @@
* IDE setup: make sure my.config is in your package search path * TOC
In runtime, ~my.config~ is imported from the user config directory dynamically. :PROPERTIES:
:TOC: :include all :depth 3
:END:
However, Pycharm/Emacs/whatever you use won't be able to figure that out, so you'd need to adjust your IDE configuration. :CONTENTS:
- [[#toc][TOC]]
- [[#running-tests][Running tests]]
- [[#ide-setup][IDE setup]]
- [[#linting][Linting]]
:END:
- Pycharm: basically, follow the instruction [[https://stackoverflow.com/a/55278260/706389][here]] * Running tests
I'm using =tox= to run test/lint. You can check out [[file:../.github/workflows/main.yml][Github Actions]] config
and [[file:../scripts/ci/run]] for the up to date info on the specifics.
* IDE setup
To benefit from type hinting, make sure =my.config= is in your package search path.
In runtime, ~my.config~ is imported from the user config directory [[file:../my/core/init.py][dynamically]].
However, Pycharm/Emacs or whatever IDE you are using won't be able to figure that out, so you'd need to adjust your IDE configuration.
- Pycharm: basically, follow the instructions [[https://stackoverflow.com/a/55278260/706389][here]]
i.e. create a new interpreter configuration (e.g. name it "Python 3.7 (for HPI)"), and add =~/.config/my=. i.e. create a new interpreter configuration (e.g. name it "Python 3.7 (for HPI)"), and add =~/.config/my=.
* Linting * Linting
You should be able to use ~./lint~ script to run mypy checks. ~tox~ should run all test, mypy, etc.
~mypy.ini~ file points at =~/.config/my= by default. If you want to run some specific parts/tests, consult [[file:tox.ini]].
Some useful flags (look them up):
- ~-e~ flag for tox
- ~-k~ flag for pytest

397
doc/MODULES.org Normal file
View file

@ -0,0 +1,397 @@
This file is an overview of *documented* modules (which I'm progressively expanding).
There are many more, see:
- [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules.
- you can also run =hpi modules= to list what's available on your system
- [[https://github.com/karlicoss/HPI][source code]] is always the primary source of truth
If you have some issues with the setup, see [[file:SETUP.org::#troubleshooting]["Troubleshooting"]].
* TOC
:PROPERTIES:
:TOC: :include all
:END:
:CONTENTS:
- [[#toc][TOC]]
- [[#intro][Intro]]
- [[#configs][Configs]]
- [[#mygoogletakeoutparser][my.google.takeout.parser]]
- [[#myhypothesis][my.hypothesis]]
- [[#myreddit][my.reddit]]
- [[#mybrowser][my.browser]]
- [[#mylocation][my.location]]
- [[#mytimetzvia_location][my.time.tz.via_location]]
- [[#mypocket][my.pocket]]
- [[#mytwittertwint][my.twitter.twint]]
- [[#mytwitterarchive][my.twitter.archive]]
- [[#mylastfm][my.lastfm]]
- [[#mypolar][my.polar]]
- [[#myinstapaper][my.instapaper]]
- [[#mygithubgdpr][my.github.gdpr]]
- [[#mygithubghexport][my.github.ghexport]]
- [[#mykobo][my.kobo]]
:END:
* Intro
See [[file:SETUP.org][SETUP]] to find out how to set up your own config.
Some explanations:
- =MY_CONFIG= is the path where you are keeping your private configuration (usually =~/.config/my/=)
- [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path=
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths.
It's 'smart', allows you to be flexible about your config:
- simple =str= or a =Path=
- =/a/path/to/directory/=, so the module will consume all files from this directory
- a list of files/directories (it will be flattened)
- a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
- empty string (e.g. ~export_path = ''~), this will prevent the module from consuming any data
This can be useful for modules that merge multiple data sources (for example, =my.twitter= or =my.github=)
Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]].
- if the field has a default value, you can omit it from your private config altogether
For more thoughts on modules and their structure, see [[file:MODULE_DESIGN.org][MODULE_DESIGN]]
* all.py
Some modules have lots of different sources for data. For example,
~my.location~ (location data) has lots of possible sources -- from
~my.google.takeout.parser~, using the ~gpslogger~ android app, or through
geolocating ~my.ip~ addresses. If you only plan on using one the modules, you
can just import from the individual module, (e.g. ~my.google.takeout.parser~)
or you can disable the others using the ~core~ config -- See the
[[https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy][MODULE_DESIGN]] docs for more details.
* Configs
The config snippets below are meant to be modified accordingly and *pasted into your private configuration*, e.g =$MY_CONFIG/my/config.py=.
You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.
For an extensive/complex example, you can check out ~@purarue~'s [[https://github.com/purarue/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]
# Nested Configurations before the doc generation using the block below
** [[file:../my/reddit][my.reddit]]
Reddit data: saved items/comments/upvotes/etc.
# Note: can't be generated as easily since this is a nested configuration object
#+begin_src python
class reddit:
class rexport:
'''
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
'''
# path[s]/glob to the exported JSON data
export_path: Paths
class pushshift:
'''
Uses [[https://github.com/purarue/pushshift_comment_export][pushshift]] to get access to old comments
'''
# path[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/browser/][my.browser]]
Parses browser history using [[http://github.com/purarue/browserexport][browserexport]]
#+begin_src python
class browser:
class export:
# path[s]/glob to your backed up browser history sqlite files
export_path: Paths
class active_browser:
# paths to sqlite database files which you use actively
# to read from. For example:
# from browserexport.browsers.all import Firefox
# export_path = Firefox.locate_database()
export_path: Paths
#+end_src
** [[file:../my/location][my.location]]
Merged location history from lots of sources.
The main sources here are
[[https://github.com/mendhak/gpslogger][gpslogger]] .gpx (XML) files, and
google takeout (using =my.google.takeout.parser=), with a fallback on
manually defined home locations.
You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to
provide geolocation data for an IPs (though no IPs are provided from any
of the sources here). For an example of usage, see [[https://github.com/purarue/HPI/tree/master/my/ip][here]]
#+begin_src python
class location:
home = (
# supports ISO strings
('2005-12-04' , (42.697842, 23.325973)), # Bulgaria, Sofia
# supports date/datetime objects
(date(year=1980, month=2, day=15) , (40.7128 , -74.0060 )), # NY
(datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558 , 37.6173 )), # Moscow, Russia
)
# note: order doesn't matter, will be sorted in the data provider
class gpslogger:
# path[s]/glob to the exported gpx files
export_path: Paths
# default accuracy for gpslogger
accuracy: float = 50.0
class via_ip:
# guess ~15km accuracy for IP addresses
accuracy: float = 15_000
#+end_src
** [[file:../my/time/tz/via_location.py][my.time.tz.via_location]]
Uses the =my.location= module to determine the timezone for a location.
This can be used to 'localize' timezones. Most modules here return
datetimes in UTC, to prevent confusion whether or not its a local
timezone, one from UTC, or one in your timezone.
Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.:
- if your objects already have tz info, you might not need to call localize() at all
- it's safer when either all of your objects are tz aware or all are tz unware, not a mixture
- you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable
#+begin_src python
TzPolicy = Literal[
'keep' , # if datetime is tz aware, just preserve it
'convert', # if datetime is tz aware, convert to provider's tz
'throw' , # if datetime is tz aware, throw exception
]
#+end_src
This is still a work in progress, plan is to integrate it with =hpi query=
so that you can easily convert/localize timezones for some module/data
#+begin_src python
class time:
class tz:
policy = 'keep'
class via_location:
# less precise, but faster
fast: bool = True
# sort locations by date
# in case multiple sources provide them out of order
sort_locations: bool = True
# if the accuracy for the location is more than 5km (this
# isn't an accurate location, so shouldn't use it to determine
# timezone), don't use
require_accuracy: float = 5_000
#+end_src
# TODO hmm. drawer raw means it can output outlines, but then have to manually erase the generated results. ugh.
#+begin_src python :dir .. :results output drawer raw :exports result
# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
# yep.. https://stackoverflow.com/q/41203765/706389
import importlib
# from lint import all_modules # meh
# TODO figure out how to discover configs automatically...
modules = [
('google' , 'my.google.takeout.parser'),
('hypothesis' , 'my.hypothesis' ),
('pocket' , 'my.pocket' ),
('twint' , 'my.twitter.twint' ),
('twitter_archive', 'my.twitter.archive' ),
('lastfm' , 'my.lastfm' ),
('polar' , 'my.polar' ),
('instapaper' , 'my.instapaper' ),
('github' , 'my.github.gdpr' ),
('github' , 'my.github.ghexport' ),
('kobo' , 'my.kobo' ),
]
def indent(s, spaces=4):
return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))
from pathlib import Path
import inspect
from dataclasses import fields
import re
print('\n') # ugh. hack for org-ruby drawers bug
for cls, p in modules:
m = importlib.import_module(p)
C = getattr(m, cls)
src = inspect.getsource(C)
i = src.find('@property')
if i != -1:
src = src[:i]
src = src.strip()
src = re.sub(r'(class \w+)\(.*', r'\1:', src)
mpath = p.replace('.', '/')
for x in ['.py', '__init__.py']:
if Path(mpath + x).exists():
mpath = mpath + x
print(f'** [[file:../{mpath}][{p}]]')
mdoc = m.__doc__
if mdoc is not None:
print(indent(mdoc))
print(f' #+begin_src python')
print(indent(src))
print(f' #+end_src')
#+end_src
#+RESULTS:
** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]]
Parses Google Takeout using [[https://github.com/purarue/google_takeout_parser][google_takeout_parser]]
See [[https://github.com/purarue/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts
If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't
cache individual exports in =~/.cache/google_takeout_parser=
The directory set as takeout_path can be unpacked directories, or
zip files of the exports, which are temporarily unpacked while creating
the cachew cache
#+begin_src python
class google(user_config):
# directory which includes unpacked/zipped takeouts
takeout_path: Paths
error_policy: ErrorPolicy = 'yield'
# experimental flag to use core.kompress.ZipPath
# instead of unpacking to a tmp dir via match_structure
_use_zippath: bool = False
#+end_src
** [[file:../my/hypothesis.py][my.hypothesis]]
[[https://hypothes.is][Hypothes.is]] highlights and annotations
#+begin_src python
class hypothesis:
'''
Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
'''
# paths[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/pocket.py][my.pocket]]
[[https://getpocket.com][Pocket]] bookmarks and highlights
#+begin_src python
class pocket:
'''
Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
'''
# paths[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/twitter/twint.py][my.twitter.twint]]
Twitter data (tweets and favorites).
Uses [[https://github.com/twintproject/twint][Twint]] data export.
Requirements: =pip3 install --user dataset=
#+begin_src python
class twint:
export_path: Paths # path[s]/glob to the twint Sqlite database
#+end_src
** [[file:../my/twitter/archive.py][my.twitter.archive]]
Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])
#+begin_src python
class twitter_archive:
export_path: Paths # path[s]/glob to the twitter archive takeout
#+end_src
** [[file:../my/lastfm][my.lastfm]]
Last.fm scrobbles
#+begin_src python
class lastfm:
"""
Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
"""
export_path: Paths
#+end_src
** [[file:../my/polar.py][my.polar]]
[[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights
#+begin_src python
class polar:
'''
Polar config is optional, you only need it if you want to specify custom 'polar_dir'
'''
polar_dir: PathIsh = Path('~/.polar').expanduser()
defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)
#+end_src
** [[file:../my/instapaper.py][my.instapaper]]
[[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations
#+begin_src python
class instapaper:
'''
Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
'''
# path[s]/glob to the exported JSON data
export_path : Paths
#+end_src
** [[file:../my/github/gdpr.py][my.github.gdpr]]
Github data (uses [[https://github.com/settings/admin][official GDPR export]])
#+begin_src python
class github:
gdpr_dir: PathIsh # path to unpacked GDPR archive
#+end_src
** [[file:../my/github/ghexport.py][my.github.ghexport]]
Github data: events, comments, etc. (API data)
#+begin_src python
class github:
'''
Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
'''
# path[s]/glob to the exported JSON data
export_path: Paths
# path to a cache directory
# if omitted, will use /tmp
cache_dir: Optional[PathIsh] = None
#+end_src
** [[file:../my/kobo.py][my.kobo]]
[[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats
#+begin_src python
class kobo:
'''
Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
'''
# path[s]/glob to the exported databases
export_path: Paths
#+end_src

331
doc/MODULE_DESIGN.org Normal file
View file

@ -0,0 +1,331 @@
Some thoughts on modules, how to structure them, and adding your own/extending HPI
This is slightly more advanced, and would be useful if you're trying to extend HPI by developing your own modules, or contributing back to HPI
* TOC
:PROPERTIES:
:TOC: :include all :depth 1 :force (nothing) :ignore (this) :local (nothing)
:END:
:CONTENTS:
- [[#allpy][all.py]]
- [[#module-count][module count]]
- [[#single-file-modules][single file modules]]
- [[#adding-new-modules][Adding new modules]]
- [[#an-extendable-module-structure][An Extendable module structure]]
- [[#logging-guidelines][Logging guidelines]]
:END:
* all.py
Some modules have lots of different sources for data. For example, ~my.location~ (location data) has lots of possible sources -- from ~my.google.takeout.parser~, using the ~gpslogger~ android app, or through geo locating ~my.ip~ addresses. For a module with multiple possible sources, its common to split it into files like:
#+begin_src
my/location
├── all.py -- specifies all possible sources/combines/merges data
├── common.py -- defines shared code, e.g. to merge data from across entries, a shared model (namedtuple/dataclass) or protocol
├── google_takeout.py -- source for data using my.google.takeout.parser
├── gpslogger.py -- source for data using gpslogger
├── home.py -- fallback source
└── via_ip.py -- source using my.ip
#+end_src
Its common for each of those sources to have their own file, like ~my.location.google_takeout~, ~my.location.gpslogger~ and ~my.location.via_ip~, and then they all get merged into a single function in ~my.location.all~, like:
#+begin_src python
from .common import Location
def locations() -> Iterator[Location]:
# can add/comment out sources here to enable/disable them
yield from _takeout_locations()
yield from _gpslogger_locations()
@import_source(module_name="my.location.google_takeout")
def _takeout_locations() -> Iterator[Location]:
from . import google_takeout
yield from google_takeout.locations()
@import_source(module_name="my.location.gpslogger")
def _gpslogger_locations() -> Iterator[Location]:
from . import gpslogger
yield from gpslogger.locations()
#+end_src
If you want to disable a source, you have a few options.
- If you're using a local editable install or just want to quickly troubleshoot, you can just comment out the line in the ~locations~ function
- Since these are decorated behind ~import_source~, they automatically catch import/config errors, so instead of fatally erroring and crashing if you don't have a module setup, it'll warn you and continue to process the other sources. To get rid of the warnings, you can add the module you're not planning on using to your core config, like:
#+begin_src python
class core:
disabled_modules = (
"my.location.gpslogger",
"my.location.via_ip",
)
#+end_src
... that suppresses the warning message and lets you use ~my.location.all~ without having to change any lines of code
Another benefit is that all the custom sources/data is localized to the ~all.py~ file, so a user can override the ~all.py~ (see the sections below on ~namespace packages~) file in their own HPI repository, adding additional sources without having to maintain a fork and patching in changes as things eventually change. For a 'real world' example of that, see [[https://github.com/purarue/HPI#partially-in-usewith-overrides][purarue]]s location and ip modules.
This is of course not required for personal or single file modules, its just the pattern that seems to have the least amount of friction for the user, while being extendable, and without using a bulky plugin system to let users add additional sources.
Another common way an ~all.py~ file is used is to merge data from a periodic export, and a GDPR export (e.g. see the ~stackexchange~, or ~github~ modules)
* module count
Having way too many modules could end up being an issue. For now, I'm basically happy to merge new modules - With the current module count, things don't seem to break much, and most of them are modules I use myself, so they get tested with my own data.
For services I don't use, I would prefer if they had tests/example data somewhere, else I can't guarantee they're still working...
Its great if when you start using HPI, you get a few modules 'for free' (perhaps ~github~ and ~reddit~), but its likely not everyone uses the same services
This shouldn't end up becoming a monorepo (a la [[https://www.spacemacs.org/][Spacemacs]]) with hundreds of modules supporting every use case. Its hard to know what the common usecase is for everyone, and new services/companies which silo your data appear all the time...
Its also not obvious how people want to access their data. This problem is often mitigated by the output of HPI being python functions -- one can always write a small script to take the output data from a module and wrangle it into some format you want
This is why HPI aims to be as extendable as possible. If you have some programming know-how, hopefully you're able to create some basic modules for yourself - plug in your own data and gain the benefits of using the functions in ~my.core~, the configuration layer and possibly libraries like [[https://github.com/karlicoss/cachew][cachew]] to 'automatically' cache your data
In some ways it may make sense to think of HPI as akin to emacs or a ones 'dotfiles'. This provides a configuration layer and structure for you to access your data, and you can extend it to your own use case.
* single file modules
... or, the question 'should we split code from individual HPI files into setuptools packages'
It's possible for a single HPI module or file to handle *everything*. Most of the python files in ~my/~ are 'single file' modules
By everything, I mean:
- Exporting data from an API/locating data on your disk/maybe saving data so you don't lose it
- Parsing data from some raw (JSON/SQLite/HTML) format
- Merging different data sources into some common =NamedTuple=-like schema
- caching expensive computation/merge results
- configuration through ~my.config~
For short modules which aren't that complex, while developing your own personal modules, or while bootstrapping modules - this is actually fine.
From a users perspective, the ability to clone and install HPI as editable, add an new python file into ~my/~, and it immediately be accessible as ~my.modulename~ is a pattern that should always be supported
However, as modules get more and more complex, especially if they include backing up/locating data from some location on your filesystem or interacting with a live API -- ideally they should be split off into their own repositories. There are trade-offs to doing this, but they are typically worth it.
As an example of this, take a look at the [[https://github.com/karlicoss/HPI/tree/5ef277526577daaa115223e79a07a064ffa9bc85/my/github][my.github]] and the corresponding [[https://github.com/karlicoss/ghexport][ghexport]] data exporter which saves github data.
- Pros:
- This allows someone to install and use ~ghexport~ without having to setup HPI at all -- its a standalone tool which means there's less barrier to entry
- It being a separate repository means issues relating to exporting data and the [[https://beepb00p.xyz/exports.html#dal][DAL]] (loading the data) can be handled there, instead of in HPI
- This reduces complexity for someone looking at the ~my.github~ files trying to debug issues related to HPI. The functionality for ~ghexport~ can be tested independently of someone new to HPI trying to debug a configuration issue
- Is easier to combine additional data sources, like ~my.github.gdpr~, which includes additional data from the GDPR export
- Cons:
- Leads to some code duplication, as you can no longer use helper functions from ~my.core~ in the new repository
- Additional boilerplate - instructions, installation scripts, testing. It's not required, but typically you want to leverage ~setuptools~ to allows ~pip install git+https...~ type installs, which are used in ~hpi module install~
- Is difficult to convert to a namespace module/directory down the road
Not all HPI Modules are currently at that level of complexity -- some are simple enough that one can understand the file by just reading it top to bottom. Some wouldn't make sense to split off into separate modules for one reason or another.
A related concern is how to structure namespace packages to allow users to easily extend them, and how this conflicts with single file modules (Keep reading below for more information on namespace packages/extension) If a module is converted from a single file module to a namespace with multiple files, it seems this is a breaking change, see [[https://github.com/karlicoss/HPI/issues/89][#89]] for an example of this. The current workaround is to leave it a regular python package with an =__init__.py= for some amount of time and send a deprecation warning, and then eventually remove the =__init__.py= file to convert it into a namespace package. For an example, see the [[https://github.com/karlicoss/HPI/blob/8422c6e420f5e274bd1da91710663be6429c666c/my/reddit/__init__.py][reddit init file]].
Its quite a pain to have to convert a file from a single file module to a namespace module, so if there's *any* possibility that you might convert it to a namespace package, might as well just start it off as one, to avoid the pain down the road. As an example, say you were creating something to parse ~zsh~ history. Instead of creating ~my/zsh.py~, it would be better to create ~my/zsh/parser.py~. That lets users override the file using editable/namespace packages, and it also means in the future its much more trivial to extend it to something like:
#+begin_src
my/zsh
├── all.py -- e.g. combined/unique/sorted zsh history
├── aliases.py -- parse zsh alias files
├── common.py -- shared models/merging code
├── compdump.py -- parse zsh compdump files
└── parser.py -- parse individual zsh history files
#+end_src
There's no requirement to follow this entire structure when you start off, the entire module could live in ~my/zsh/parser.py~, including all the merging/parsing/locating code. It just avoids the trouble in the future, and the only downside is having to type a bit more when importing from it.
#+html: <div id="addingmodules"></div>
* Adding new modules
As always, if the changes you wish to make are small, or you just want to add a few modules, you can clone and edit an editable install of HPI. See [[file:SETUP.org][SETUP]] for more information
The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to =PYTHONPATH= (or use 'editable namespace packages' as described below, which also modifies your computed ~sys.path~)
# TODO link to 'overlays' documentation?
You can check my own [[https://github.com/karlicoss/hpi-personal-overlay][personal overlay]] as a reference.
For example, if you want to add an =awesomedatasource=, it could be:
: custom_module
: └── my
: └──awesomedatasource.py
You can use all existing HPI modules in =awesomedatasource.py=, including =my.config= and everything from =my.core=.
=hpi modules= or =hpi doctor= commands should also detect your extra modules.
- In addition, you can *override* the builtin HPI modules too:
: custom_lastfm_overlay
: └── my
: └──lastfm.py
Now if you add =custom_lastfm_overlay= [[https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH][*in front* of ~PYTHONPATH~]], all the downstream scripts using =my.lastfm= will load it from =custom_lastfm_overlay= instead.
This could be useful to monkey patch some behaviours, or dynamically add some extra data sources -- anything that comes to your mind.
You can check [[https://github.com/karlicoss/hpi-personal-overlay/blob/7fca8b1b6031bf418078da2d8be70fd81d2d8fa0/src/my/calendar/holidays.py#L1-L14][my.calendar.holidays]] in my personal overlay as a reference.
** Namespace Packages
Note: this section covers some of the complexities and benefits with this being a namespace package and/or editable install, so it assumes some familiarity with python/imports
HPI is installed as a namespace package, which allows an additional way to add your own modules. For the details on namespace packages, see [[https://www.python.org/dev/peps/pep-0420/][PEP420]], or the [[https://packaging.python.org/guides/packaging-namespace-packages][packaging docs for a summary]], but for our use case, a sufficient description might be: Namespace packages let you split a package across multiple directories on disk.
Without adding a bulky/boilerplate-y plugin framework to HPI, as that increases the barrier to entry, [[https://packaging.python.org/guides/creating-and-discovering-plugins/#using-namespace-packages][namespace packages offers an alternative]] with little downsides.
Creating a separate file hierarchy still allows you to keep up to date with any changes from this repository by running ~git pull~ on your local clone of HPI periodically (assuming you've installed it as an editable package (~pip install -e .~)), while creating your own modules, and possibly overwriting any files you wish to override/overlay.
In order to do that, like stated above, you could edit the ~PYTHONPATH~ variable, which in turn modifies your computed ~sys.path~, which is how python [[https://docs.python.org/3/library/sys.html?highlight=pythonpath#sys.path][determines the search path for modules]]. This is sort of what [[file:../with_my][with_my]] allows you to do.
In the context of HPI, it being a namespace package means you can have a local clone of this repository, and your own 'HPI' modules in a separate folder, which then get combined into the ~my~ package.
As an example, say you were trying to override the ~my.lastfm~ file, to include some new feature. You could create a new file hierarchy like:
: .
: ├── my
: │   ├── lastfm.py
: │   └── some_new_module.py
: └── setup.py
Where ~lastfm.py~ is your version of ~my.lastfm~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:
#+begin_src python
from setuptools import setup, find_namespace_packages
# should use a different name,
# so its possible to differentiate between HPI installs
setup(
name=f"my-HPI-overlay",
zip_safe=False,
packages=find_namespace_packages(".", include=("my*")),
)
#+end_src
Then, running ~python3 -m pip install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~lastfm.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.
If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/purarue/reorder_editable][reorder_editable]] repository.
There is no limit to how many directories you could install into a single namespace package, which could be a possible way for people to install additional HPI modules, without worrying about the module count here becoming too large to manage.
There are some other users [[https://github.com/hpi/hpi][who have begun publishing their own modules]] as namespace packages, which you could potentially install and use, in addition to this repository, if any of those interest you. If you want to create your own you can use the [[https://github.com/purarue/HPI-template][template]] to get started.
Though, enabling this many modules may make ~hpi doctor~ look pretty busy. You can explicitly choose to enable/disable modules with a list of modules/regexes in your [[https://github.com/karlicoss/HPI/blob/f559e7cb899107538e6c6bbcf7576780604697ef/my/core/core_config.py#L24-L55][core config]], see [[https://github.com/purarue/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L42-L72][here]] for an example.
You may use the other modules or [[https://github.com/karlicoss/hpi-personal-overlay][my overlay]] as reference, but python packaging is already a complicated issue, before adding complexities like namespace packages and editable installs on top of it... If you're having trouble extending HPI in this fashion, you can open an issue here, preferably with a link to your code/repository and/or ~setup.py~ you're trying to use.
* An Extendable module structure
In this context, 'overlay'/'override' means you create your own namespace package/file structure like described above, and since your files are in front of the upstream repository files in the computed ~sys.path~ (either by using namespace modules, the ~PYTHONPATH~ or ~with_my~), your file overrides the upstream repository
Related issues: [[https://github.com/karlicoss/HPI/issues/102][#102]], [[https://github.com/karlicoss/HPI/issues/89][#89]], [[https://github.com/karlicoss/HPI/issues/154][#154]]
The main goals are:
- low effort: ideally it should be a matter of a few lines of code to override something.
- good interop: e.g. ability to keep with the upstream, use modules coming from separate repositories, etc.
- ideally mypy friendly. This kind of means 'not too dynamic and magical', which is ultimately a good thing even if you don't care about mypy.
~all.py~ using modules/sources behind ~import_source~ is the solution we've arrived at in HPI, because it meets all of these goals:
- it doesn't require an additional plugin system, is just python imports and
namespace packages
- is generally mypy friendly (the only exception is the ~import_source~
decorator, but that typically returns nothing if the import failed)
- doesn't require you to maintain a fork of this repository, though you can maintain a separate HPI repository (so no patching/merge conflicts)
- allows you to easily add/remove sources to the ~all.py~ module, either by:
- overriding an ~all.py~ in your own repository
- just commenting out the source/adding 2 lines to import and ~yield from~ your new source
- doing nothing! (~import_source~ will catch the error and just warn you
and continue to work without changing any code)
It could be argued that namespace packages and editable installs are a bit complex for a new user to get the hang of, and this is true. But fortunately ~import_source~ means any user just using HPI only needs to follow the instructions when a warning is printed, or peruse the docs here a bit -- there's no need to clone or create your own override to just use the ~all.py~ file.
There's no requirement to use this for individual modules, it just seems to be the best solution we've arrived at so far
* Logging guidelines
HPI doesn't enforce any specific logging mechanism, you're free to use whatever you prefer in your modules.
However there are some general guidelines for developing modules that can make them more pleasant to use.
- each module should have its unique logger, the easiest way to ensure that is simply use module's ~__name__~ attribute as the logger name
In addition, this ensures the logger hierarchy reflect the package hierarchy.
For instance, if you initialize the logger for =my.module= with specific settings, the logger for =my.module.helper= would inherit these settings. See more on that [[ https://docs.python.org/3/library/logging.html?highlight=logging#logger-objects][in python docs]].
As a bonus, if you use the module ~__name__~, this logger will be automatically be picked up and used by ~cachew~.
- often modules are processing multiple files, extracting data from each one ([[https://beepb00p.xyz/exports.html#types][incremental/synthetic exports]])
It's nice to log each file name you're processing as =logger.info= so the user of module gets a sense of progress.
If possible, add the index of file you're processing and the total count.
#+begin_src python
def process_all_data():
paths = inputs()
total = len(paths)
width = len(str(total))
for idx, path in enumerate(paths):
# :>{width} to align the logs vertically
logger.info(f'processing [{idx:>{width}}/{total:>{width}}] {path}')
yield from process_path(path)
#+end_src
If there is a lot of logging happening related to a specific path, instead of adding path to each logging message manually, consider using [[https://docs.python.org/3/library/logging.html?highlight=loggeradapter#logging.LoggerAdapter][LoggerAdapter]].
- log exceptions, but sparingly
Generally it's a good practice to call ~logging.exception~ from the ~except~ clause, so it's immediately visible where the errors are happening.
However, in HPI, instead of crashing on exceptions we often behave defensively and ~yield~ them instead (see [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]]).
In this case logging every time may become a bit spammy, so use exception logging sparingly in this case.
Typically it's best to rely on the downstream data consumer to handle the exceptions properly.
- instead of =logging.getLogger=, it's best to use =my.core.make_logger=
#+begin_src python
from my.core import make_logger
logger = make_logger(__name__)
# or to set a custom level
logger = make_logger(__name__, level='warning')
#+end_src
This sets up some nicer defaults over standard =logging= module:
- colored logs (via =colorlog= library)
- =INFO= as the initial logging level (instead of default =ERROR=)
- logging full exception trace when even when logging outside of the exception handler
This is particularly useful for [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]].
By default, =logging= only logs the exception message (without the trace) in this case, which makes errors harder to debug.
- control logging level from the shell via ~LOGGING_LEVEL_*~ env variable
This can be useful to suppress logging output if it's too spammy, or showing more output for debugging.
E.g. ~LOGGING_LEVEL_my_instagram_gdpr=DEBUG hpi query my.instagram.gdpr.messages~
- experimental: passing env variable ~LOGGING_COLLAPSE=<loglevel>~ will "collapse" logging with the same level
Instead of printing new logging line each time, it will 'redraw' the last logged line with a new logging message.
This can be convenient if there are too many logs, you just need logging to get a sense of progress.
- experimental: passing env variable ~ENLIGHTEN_ENABLE=yes~ will display TUI progress bars in some cases
See [[https://github.com/Rockhopper-Technologies/enlighten#readme][https://github.com/Rockhopper-Technologies/enlighten#readme]]
This can be convenient for showing the progress of parallel processing of different files from HPI:
#+BEGIN_EXAMPLE
ghexport.dal[111] 29%|████████████████████ | 29/100 [00:03<00:07, 10.03 files/s]
rexport.dal[comments] 17%|████████ | 115/682 [00:03<00:14, 39.15 files/s]
my.instagram.android 0%|▎ | 3/2631 [00:02<34:50, 1.26 files/s]
#+END_EXAMPLE

322
doc/OVERLAYS.org Normal file
View file

@ -0,0 +1,322 @@
NOTE this kinda overlaps with [[file:MODULE_DESIGN.org][the module design doc]], should be unified in the future.
Relevant discussion about overlays: https://github.com/karlicoss/HPI/issues/102
# This is describing TODO
# TODO goals
# - overrides
# - proper mypy support
# - TODO reusing parent modules?
# You can see them TODO in overlays dir
Consider a toy package/module structure with minimal code, without any actual data parsing, just for demonstration purposes.
- =main= package structure
# TODO do links
- =my/twitter/gdpr.py=
Extracts Twitter data from GDPR archive.
- =my/twitter/all.py=
Merges twitter data from multiple sources (only =gdpr= in this case), so data consumers are agnostic of specific data sources used.
This will be overridden by =overlay=.
- =my/twitter/common.py=
Contains helper function to merge data, so they can be reused by overlay's =all.py=.
- =my/reddit.py=
Extracts Reddit data -- this won't be overridden by the overlay, we just keep it for demonstration purposes.
- =overlay= package structure
- =my/twitter/talon.py=
Extracts Twitter data from Talon android app.
- =my/twitter/all.py=
Override for =all.py= from =main= package -- it merges together data from =gpdr= and =talon= modules.
# TODO mention resolution? reorder_editable
* Installing (editable install)
NOTE: this was tested with =python 3.10= and =pip 23.3.2=.
To install, we run:
: pip3 install --user -e overlay/
: pip3 install --user -e main/
# TODO mention non-editable installs (this bit will still work with non-editable install)
As a result, we get:
: pip3 list | grep hpi
: hpi-main 0.0.0 /project/main/src
: hpi-overlay 0.0.0 /project/overlay/src
: cat ~/.local/lib/python3.10/site-packages/easy-install.pth
: /project/overlay/src
: /project/main/src
(the order above is important, so =overlay= takes precedence over =main= TODO link)
Verify the setup:
: $ python3 -c 'import my; print(my.__path__)'
: _NamespacePath(['/project/overlay/src/my', '/project/main/src/my'])
This basically means that modules will be searched in both paths, with overlay taking precedence.
** Installing with =--use-pep517=
See here for discussion https://github.com/purarue/reorder_editable/issues/2, but TLDR it should work similarly.
* Testing runtime behaviour (editable install)
: $ python3 -c 'import my.reddit as R; print(R.upvotes())'
: [main] my.reddit hello
: ['reddit upvote1', 'reddit upvote2']
Just as expected here, =my.reddit= is imported from the =main= package, since it doesn't exist in =overlay=.
Let's theck twitter now:
: $ python3 -c 'import my.twitter.all as T; print(T.tweets())'
: [overlay] my.twitter.all hello
: [main] my.twitter.common hello
: [main] my.twitter.gdpr hello
: [overlay] my.twitter.talon hello
: ['gdpr tweet 1', 'gdpr tweet 2', 'talon tweet 1', 'talon tweet 2']
As expected, =my.twitter.all= was imported from the =overlay=.
As you can see it's merged data from =gdpr= (from =main= package) and =talon= (from =overlay= package).
So far so good, let's see how it works with mypy.
* Mypy support (editable install)
To check that mypy works as expected I injected some statements in modules that have no impact on runtime,
but should trigger mypy, like this =trigger_mypy_error: str = 123=:
Let's run it:
: $ mypy --namespace-packages --strict -p my
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^
: Found 1 error in 1 file (checked 4 source files)
Hmm, this did find the statement in the =overlay=, but missed everything from =main= (e.g. =reddit.py= and =gdpr.py= should have also triggered the check).
First, let's check which sources mypy is processing:
: $ mypy --namespace-packages --strict -p my -v 2>&1 | grep BuildSource
: LOG: Found source: BuildSource(path='/project/overlay/src/my', module='my', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter', module='my.twitter', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter/all.py', module='my.twitter.all', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter/talon.py', module='my.twitter.talon', has_text=False, base_dir=None)
So seems like mypy is not processing anything from =main= package at all?
At this point I cloned mypy, put a breakpoint, and found out this is the culprit: https://github.com/python/mypy/blob/1dd8e7fe654991b01bd80ef7f1f675d9e3910c3a/mypy/modulefinder.py#L288
This basically returns the first path where it finds =my= package, which happens to be the overlay in this case.
So everything else is ignored?
It even seems to have a test for a similar usecase, which is quite sad.
https://github.com/python/mypy/blob/1dd8e7fe654991b01bd80ef7f1f675d9e3910c3a/mypy/test/testmodulefinder.py#L64-L71
For now, I opened an issue in mypy repository https://github.com/python/mypy/issues/16683
But ok, maybe mypy treats =main= as an external package somehow but still type checks it properly?
Let's see what's going on with imports:
: $ mypy --namespace-packages --strict -p my --follow-imports=error
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^
: overlay/src/my/twitter/all.py:3: error: Import of "my.twitter.common" ignored [misc]
: from .common import merge
: ^
: overlay/src/my/twitter/all.py:6: error: Import of "my.twitter.gdpr" ignored [misc]
: from . import gdpr
: ^
: overlay/src/my/twitter/all.py:6: note: (Using --follow-imports=error, module not passed on command line)
: overlay/src/my/twitter/all.py: note: In function "tweets":
: overlay/src/my/twitter/all.py:8: error: Returning Any from function declared to return "List[str]" [no-any-return]
: return merge(gdpr, talon)
: ^
: Found 4 errors in 2 files (checked 4 source files)
Nope -- looks like it's completely unawareof =main=, and what's worst, by default (without tweaking =--follow-imports=), these errors would be suppressed.
What if we check =my.twitter= directly?
: $ mypy --namespace-packages --strict -p my.twitter --follow-imports=error
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^~~
: overlay/src/my/twitter: error: Ancestor package "my" ignored [misc]
: overlay/src/my/twitter: note: (Using --follow-imports=error, submodule passed on command line)
: overlay/src/my/twitter/all.py:3: error: Import of "my.twitter.common" ignored [misc]
: from .common import merge
: ^
: overlay/src/my/twitter/all.py:3: note: (Using --follow-imports=error, module not passed on command line)
: overlay/src/my/twitter/all.py:6: error: Import of "my.twitter.gdpr" ignored [misc]
: from . import gdpr
: ^
: overlay/src/my/twitter/all.py: note: In function "tweets":
: overlay/src/my/twitter/all.py:8: error: Returning Any from function declared to return "list[str]" [no-any-return]
: return merge(gdpr, talon)
: ^~~~~~~~~~~~~~~~~~~~~~~~~
: Found 5 errors in 3 files (checked 3 source files)
Now we're also getting =error: Ancestor package "my" ignored [misc]= .. not ideal.
* What if we don't install at all?
Instead of editable install let's try running mypy directly over source files
First let's only check =main= package:
: $ MYPYPATH=main/src mypy --namespace-packages --strict -p my
: main/src/my/twitter/gdpr.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: main/src/my/reddit.py:11: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: Found 2 errors in 2 files (checked 6 source files)
As expected, it found both errors.
Now with overlay as well:
: $ MYPYPATH=overlay/src:main/src mypy --namespace-packages --strict -p my
: overlay/src/my/twitter/all.py:6: note: In module imported here:
: main/src/my/twitter/gdpr.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^~~
: Found 2 errors in 2 files (checked 4 source files)
Interesting enough, this is slightly better than the editable install (it detected error in =gdpr.py= as well).
But still no =reddit.py= error.
TODO possibly worth submitting to mypy issue tracker as well...
Overall it seems that properly type checking HPI setup as a whole is kinda problematic, especially if the modules actually override/extend base modules.
* Modifying (monkey patching) original module in the overlay
Let's say we want to modify/monkey patch =my.twitter.talon= module from =main=, for example, convert "gdpr" to uppercase, i.e. =tweet.replace('gdpr', 'GDPR')=.
# TODO see overlay2/
I think our options are:
- symlink to the 'parent' packages, e.g. =main= in the case
Alternatively, somehow install =main= under a different name/alias (managed by pip).
This is discussed here: https://github.com/karlicoss/HPI/issues/102
The main upside is that it's relatively simple and (sort of works with mypy).
There are a few big downsides:
- creates a parallel package hierarchy (to the one maintained by pip), symlinks will need to be carefully managed manually
This may not be such a huge deal if you don't have too many overlays.
However this results in problems if you're trying to switch between two different HPI checkouts (e.g. stable and development). If you have symlinks into "stable" from the overlay then stable modules will sometimes be picked up when you're expecting "development" package.
- symlinks pointing outside of the source tree might cause pip install to go into infinite loop
- it modifies the package name
This may potentially result in some confusing behaviours.
One thing I noticed for example is that cachew caches might get duplicated.
- it might not work in all cases or might result in recursive imports
- do not shadow the original module
Basically instead of shadowing via namespace package mechanism and creating identically named module,
create some sort of hook that would patch the original =my.twitter.talon= module from =main=.
The downside is that it's a bit unclear where to do that, we need some sort of entry point?
- it could be some global dynamic hook defined in the overlay, and then executed from =my.core=
However, it's a bit intrusive, and unclear how to handle errors. E.g. what if we're monkey patching a module that we weren't intending to use, don't have dependencies installed and it's crashing?
Perhaps core could support something like =_hook= in each of HPI's modules?
Note that it can't be =my.twitter.all=, since we might want to override =.all= itself.
The downside is is this probably not going to work well with =tmp_config= and such -- we'll need to somehow execute the hook again on reloading the module?
- ideally we'd have something that integrates with =importlib= and executed automatically when module is imported?
TODO explore these:
- https://stackoverflow.com/questions/43571737/how-to-implement-an-import-hook-that-can-modify-the-source-code-on-the-fly-using
- https://github.com/brettlangdon/importhook
This one is pretty intrusive, and has some issues, e.g. https://github.com/brettlangdon/importhook/issues/4
Let's try it:
: $ PYTHONPATH=overlay3/src:main/src python3 -c 'import my.twitter._hook; import my.twitter.all as M; print(M.tweets())'
: [main] my.twitter.all hello
: [main] my.twitter.common hello
: [main] my.twitter.gdpr hello
: EXECUTING IMPORT HOOK!
: ['GDPR tweet 1', 'GDPR tweet 2']
Ok it worked, and seems pretty neat.
However sadly it doesn't work with =tmp_config= (TODO add a proper demo?)
Not sure if it's more of an issue with =tmp_config= implementation (which is very hacky), or =importhook= itself?
In addition, still the question is where to put the hook itself, but in that case even a global one could be fine.
- define hook in =my/twitter/__init__.py=
Basically, use =extend_path= to make it behave like a namespace package, but in addition, patch original =my.twitter.talon=?
: $ cat overlay2/src/my/twitter/__init__.py
: print(f'[overlay2] {__name__} hello')
:
: from pkgutil import extend_path
: __path__ = extend_path(__path__, __name__)
:
: def hack_gdpr_module() -> None:
: from . import gdpr
: tweets_orig = gdpr.tweets
: def tweets_patched():
: return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
: gdpr.tweets = tweets_patched
:
: hack_gdpr_module()
This actually seems to work??
: PYTHONPATH=overlay2/src:main/src python3 -c 'import my.twitter.all as M; print(M.tweets())'
: [overlay2] my.twitter hello
: [main] my.twitter.gdpr hello
: [main] my.twitter.all hello
: [main] my.twitter.common hello
: ['GDPR tweet 1', 'GDPR tweet 2']
However, this doesn't stack, i.e. if the 'parent' overlay had its own =__init__.py=, it won't get called.
- shadow the original module and temporarily modify =__path__= before importing the same module from the parent overlay
This approach is implemented in =my.core.experimental.import_original_module=
TODO demonstrate it properly, but I think that also works in a 'chain' of overlays
Seems like that option is the most promising so far, albeit very hacky.
Note that none of these options work well with mypy (since it's all dynamic hackery), even if you disregard the issues described in the previous sections.
# TODO .pkg files? somewhat interesting... https://github.com/python/cpython/blob/3.12/Lib/pkgutil.py#L395-L410

304
doc/QUERY.md Normal file
View file

@ -0,0 +1,304 @@
`hpi query` is a command line tool for querying the output of any `hpi` function.
```
Usage: hpi query [OPTIONS] FUNCTION_NAME...
This allows you to query the results from one or more functions in HPI
By default this runs with '-o json', converting the results to JSON and
printing them to STDOUT
You can specify '-o pprint' to just print the objects using their repr, or
'-o repl' to drop into a ipython shell with access to the results
While filtering using --order-key datetime, the --after, --before and
--within flags parse the input to their datetime and timedelta equivalents.
datetimes can be epoch time, the string 'now', or an date formatted in the
ISO format. timedelta (durations) are parsed from a similar format to the
GNU 'sleep' command, e.g. 1w2d8h5m20s -> 1 week, 2 days, 8 hours, 5 minutes,
20 seconds
As an example, to query reddit comments I've made in the last month
hpi query --order-type datetime --before now --within 4w my.reddit.all.comments
or...
hpi query --recent 4w my.reddit.all.comments
Can also query within a range. To filter comments between 2016 and 2018:
hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
Options:
-o, --output [json|pprint|repl|gpx]
what to do with the result [default: json]
-s, --stream stream objects from the data source instead
of printing a list at the end
-k, --order-key TEXT order by an object attribute or dict key on
the individual objects returned by the HPI
function
-t, --order-type [datetime|date|int|float]
order by searching for some type on the
iterable
-a, --after TEXT while ordering, filter items for the key or
type larger than or equal to this
-b, --before TEXT while ordering, filter items for the key or
type smaller than this
-w, --within TEXT a range 'after' or 'before' to filter items
by. see above for further explanation
-r, --recent TEXT a shorthand for '--order-type datetime
--reverse --before now --within'. e.g.
--recent 5d
--reverse / --no-reverse reverse the results returned from the
functions
-l, --limit INTEGER limit the number of items returned from the
(functions)
--drop-unsorted if the order of an item can't be determined
while ordering, drop those items from the
results
--wrap-unsorted if the order of an item can't be determined
while ordering, wrap them into an
'Unsortable' object
--warn-exceptions if any errors are returned, print them as
errors on STDERR
--raise-exceptions if any errors are returned (as objects, not
raised) from the functions, raise them
--drop-exceptions ignore any errors returned as objects from
the functions
--help Show this message and exit.
```
This works with any function which returns an iterable, for example `my.coding.commits`, which searches for `git commit`s on your computer:
```bash
hpi query my.coding.commits
```
When run with a module, this does some analysis of the functions in that module and tries to find ones that look like data sources. If it can't figure out which, it prompts you like:
```
Which function should be used from 'my.coding.commits'?
1. commits
2. repos
```
You select the one you want by clicking `1` or `2` on your keyboard. Otherwise, you can provide a fully qualified path, like:
```
hpi query my.coding.commits.repos
```
The corresponding `repos` function this queries is defined in [`my/coding/commits.py`](../my/coding/commits.py)
### Ordering/Filtering/Streaming
By default, this just returns the items in the order they were returned by the function. This allows you to filter by specifying a `--order-key`, or `--order-type`. For example, to get the 10 most recent commits. `--order-type datetime` will try to automatically figure out which attribute to use. If it chooses the wrong one (since `Commit`s have both a `committed_dt` and `authored_dt`), you could tell it which to use. For example, to scan my computer and find the most recent commit I made:
```
hpi query my.coding.commits.commits --order-key committed_dt --limit 1 --reverse --output pprint --stream
Commit(committed_dt=datetime.datetime(2023, 4, 14, 23, 9, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
authored_dt=datetime.datetime(2023, 4, 14, 23, 4, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
message='sources.smscalls: propagate errors if there are breaking '
'schema changes',
repo='/home/username/Repos/promnesia-fork',
sha='22a434fca9a28df9b0915ccf16368df129d2c9ce',
ref='refs/heads/smscalls-handle-result')
```
To instead limit in some range, you can use `--before` and `--within` to filter by a range. For example, to get all the commits I committed in the last day:
```
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d
```
That prints a a list of `Commit` as JSON objects. You could also use `--output pprint` to pretty-print the objects or `--output repl` drop into a REPL.
To process the JSON, you can pipe it to [`jq`](https://github.com/stedolan/jq). I often use `jq length` to get the count of some output:
```
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d | jq length
6
```
Because grabbing data `--before now` is such a common use case, the `--recent` flag is a shorthand for `--order-type datetime --reverse --before now --within`. The same as above, to get the commits from the last day:
```
hpi query my.coding.commits.commits --recent 1d | jq length
6
```
To select a range of commits, you can use `--after` and `--before`, passing ISO or epoch timestamps. Those can be full `datetimes` (`2021-01-01T00:05:30`) or just dates (`2021-01-01`). For example, to get all the commits I made on January 1st, 2021:
```
hpi query my.coding.commits.commits --order-type datetime --after 2021-01-01 --before 2021-01-02 | jq length
1
```
If you have [`dateparser`](https://github.com/scrapinghub/dateparser#how-to-use) installed, this supports dozens more natural language formats:
```
hpi query my.coding.commits.commits --order-type datetime --after 'last week' --before 'day before yesterday' | jq length
28
```
If you're having issues ordering because there are exceptions in your results not all data is sortable (may have `None` for some attributes), you can use `--drop-unsorted` to drop those items from the results, or `--drop-exceptions` to remove the exceptions
You can also stream the results, which is useful for functions that take a while to process or have a lot of data. For example, if you wanted to pick a sha hash from a particular repo, you could combine `jq` to `select` and pick that attribute from the JSON:
```
hpi query my.coding.commits.commits --recent 30d --stream | jq 'select(.repo | contains("HPI"))' | jq '.sha' -r
4afa899c8b365b3c10e468f6279c02e316d3b650
40de162fab741df594b4d9651348ee46ee021e9b
e1cb229913482074dc5523e57ef0acf6e9ec2bb2
87c13defd131e39292b93dcea661d3191222dace
02c738594f2cae36ca4fab43cf9533fe6aa89396
0b3a2a6ef3a9e4992771aaea0252fb28217b814a
84817ce72d208038b66f634d4ceb6e3a4c7ec5e9
47992b8e046d27fc5141839179f06f925c159510
425615614bd508e28ccceb56f43c692240e429ab
eed8f949460d768fb1f1c4801e9abab58a5f9021
d26ad7d9ce6a4718f96346b994c3c1cd0d74380c
aec517e53c6ac022f2b4cc91261daab5651cebf0
44b75a88fdfc7af132f61905232877031ce32fcb
b0ff6f29dd2846e97f8aa85a2ca73736b03254a8
```
`jq`s `select` function acts on a stream of JSON objects, not a list, so it filters the output of `hpi query` the objects are generated (the goal here is to conserve memory as items which aren't needed are filtered). The alternative would be to print the entire JSON list at the end, like:
`hpi query my.coding.commits.commits --recent 30d | jq '.[] | select(.repo | contains("Repos/HPI"))' | jq '.sha' -r`, using `jq '.[]'` to convert the JSON list into a stream of JSON objects.
## Usage on non-HPI code
The command can accept any qualified function name, so this could for example be used to check the output of [`promnesia`](https://github.com/karlicoss/promnesia) sources:
```
hpi query promnesia.sources.smscalls | jq length
371
```
This can be used on any function that produces an `Iterator`/`Generator` like output, as long as it can be called with no arguments.
## GPX
The `hpi query` command can also be used with the `--output gpx` flag to generate gpx files from a list of locations, like the ones defined in the `my.location` package. This could be used to extract some date range and create a `gpx` file which can then be visualized by a GUI application.
This prints the contents for the `gpx` file to STDOUT, and prints warnings for any objects it could not convert to locations to STDERR, so pipe STDOUT to a output file, like `>out.gpx`
```
hpi query my.location.all --after '2021-07-01T00:00:00' --before '2021-07-05T00:00:00' --order-type datetime --output gpx >out.gpx
```
If you want to ignore any errors, you can use `--drop-exceptions`.
To preview, you can use something like [`qgis`](https://qgis.org/en/site/) or for something easier more lightweight, [`gpxsee`](https://github.com/tumic0/GPXSee):
`gpxsee out.gpx`:
<img src="https://user-images.githubusercontent.com/7804791/232249184-7e203ee6-a3ec-4053-800c-751d2c28e690.png" width=500 alt="chicago trip" />
(Sidenote: this is [`@purarue`](https://github.com/purarue/)s locations, on a trip to Chicago)
## Python reference
The `hpi query` command is a CLI wrapper around the code in [`query.py`](../my/core/query.py) and [`query_range.py`](../my/core/query_range.py). The `select` function is the core of this, and `select_range` lets you specify dates, timedelta, start-end ranges, and other CLI-specific code.
`my.core.query.select`:
```
A function to query, order, sort and filter items from one or more sources
This supports iterables and lists of mixed types (including handling errors),
by allowing you to provide custom predicates (functions) which can sort
by a function, an attribute, dict key, or by the attributes values.
Since this supports mixed types, there's always a possibility
of KeyErrors or AttributeErrors while trying to find some value to order by,
so this provides multiple mechanisms to deal with that
'where' lets you filter items before ordering, to remove possible errors
or filter the iterator by some condition
There are multiple ways to instruct select on how to order items. The most
flexible is to provide an 'order_by' function, which takes an item in the
iterator, does any custom checks you may want and then returns the value to sort by
'order_key' is best used on items which have a similar structure, or have
the same attribute name for every item in the iterator. If you have a
iterator of objects whose datetime is accessed by the 'timestamp' attribute,
supplying order_key='timestamp' would sort by that (dictionary or attribute) key
'order_value' is the most confusing, but often the most useful. Instead of
testing against the keys of an item, this allows you to write a predicate
(function) to test against its values (dictionary, NamedTuple, dataclass, object).
If you had an iterator of mixed types and wanted to sort by the datetime,
but the attribute to access the datetime is different on each type, you can
provide `order_value=lambda v: isinstance(v, datetime)`, and this will
try to find that value for each type in the iterator, to sort it by
the value which is received when the predicate is true
'order_value' is often used in the 'hpi query' interface, because of its brevity.
Just given the input function, this can typically sort it by timestamp with
no human intervention. It can sort of be thought as an educated guess,
but it can always be improved by providing a more complete guess function
Note that 'order_value' is also the most computationally expensive, as it has
to copy the iterator in memory (using itertools.tee) to determine how to order it
in memory
The 'drop_exceptions', 'raise_exceptions', 'warn_exceptions' let you ignore or raise
when the src contains exceptions. The 'warn_func' lets you provide a custom function
to call when an exception is encountered instead of using the 'warnings' module
src: an iterable of mixed types, or a function to be called,
as the input to this function
where: a predicate which filters the results before sorting
order_by: a function which when given an item in the src,
returns the value to sort by. Similar to the 'key' value
typically passed directly to 'sorted'
order_key: a string which represents a dict key or attribute name
to use as they key to sort by
order_value: predicate which determines which attribute on an ADT-like item to sort by,
when given its value. lambda o: isinstance(o, datetime) is commonly passed to sort
by datetime, without knowing the attributes or interface for the items in the src
default: while ordering, if the order for an object cannot be determined,
use this as the default value
reverse: reverse the order of the resulting iterable
limit: limit the results to this many items
drop_unsorted: before ordering, drop any items from the iterable for which a
order could not be determined. False by default
wrap_unsorted: before ordering, wrap any items into an 'Unsortable' object. Place
them at the front of the list. True by default
drop_exceptions: ignore any exceptions from the src
raise_exceptions: raise exceptions when received from the input src
```
`my.core.query_range.select_range`:
```
A specialized select function which offers generating functions
to filter/query ranges from an iterable
order_key and order_value are used in the same way they are in select
If you specify order_by_value_type, it tries to search for an attribute
on each object/type which has that type, ordering the iterable by that value
unparsed_range is a tuple of length 3, specifying 'after', 'before', 'duration',
i.e. some start point to allow the computed value we're ordering by, some
end point and a duration (can use the RangeTuple NamedTuple to construct one)
(this is typically parsed/created in my.core.__main__, from CLI flags
If you specify a range, drop_unsorted is forced to be True
```
Those can be imported and accept any sort of iterator, `hpi query` just defaults to the output of functions here. As an example, see [`listens`](https://github.com/purarue/HPI-personal/blob/master/scripts/listens) which just passes an generator (iterator) as the first argument to `query_range`

View file

@ -2,39 +2,83 @@
Please don't be shy and raise issues if something in the instructions is unclear. Please don't be shy and raise issues if something in the instructions is unclear.
You'd be really helping me, I want to make the setup as straightforward as possible! You'd be really helping me, I want to make the setup as straightforward as possible!
* Few notes # update with org-make-toc
I understand people may not super familiar with Python, PIP or generally unix, so here are some short notes: * TOC
:PROPERTIES:
:TOC: :include all
:END:
- only python3 is supported, and more specifically, ~python >= 3.5~. :CONTENTS:
- [[#toc][TOC]]
- [[#few-notes][Few notes]]
- [[#install-main-hpi-package][Install main HPI package]]
- [[#option-1-install-from-pip][option 1: install from PIP]]
- [[#option-2-localeditable-install][option 2: local/editable install]]
- [[#option-3-use-without-installing][option 3: use without installing]]
- [[#appendix-optional-packages][appendix: optional packages]]
- [[#setting-up-modules][Setting up modules]]
- [[#private-configuration-myconfig][private configuration (my.config)]]
- [[#module-dependencies][module dependencies]]
- [[#troubleshooting][Troubleshooting]]
- [[#common-issues][common issues]]
- [[#usage-examples][Usage examples]]
- [[#end-to-end-roam-research-setup][End-to-end Roam Research setup]]
- [[#polar][Polar]]
- [[#google-takeout][Google Takeout]]
- [[#kobo-reader][Kobo reader]]
- [[#orger][Orger]]
- [[#orger--polar][Orger + Polar]]
- [[#demopy][demo.py]]
- [[#data-flow][Data flow]]
- [[#polar-bookshelf][Polar Bookshelf]]
- [[#google-takeout][Google Takeout]]
- [[#reddit][Reddit]]
- [[#twitter][Twitter]]
- [[#connecting-to-other-apps][Connecting to other apps]]
- [[#addingmodifying-modules][Adding/modifying modules]]
:END:
* Few notes
I understand that people who'd like to use this may not be super familiar with Python, pip or generally unix, so here are some useful notes:
- only ~python >= 3.7~ is supported
- I'm using ~pip3~ command, but on your system you might only have ~pip~. - I'm using ~pip3~ command, but on your system you might only have ~pip~.
If your ~pip --version~ says python 3, feel free to use ~pip~. If your ~pip --version~ says python 3, feel free to use ~pip~.
- If you have issues getting ~pip~ or ~pip3~ to work, it may be worth invoking the module instead using a fully qualified path, like ~python3 -m pip~ (e.g. ~python3 -m pip install --user ..~)
- similarly, I'm using =python3= in the documentation, but if your =python --version= says python3, it's okay to use =python= - similarly, I'm using =python3= in the documentation, but if your =python --version= says python3, it's okay to use =python=
- when you use ~pip install~, [[https://stackoverflow.com/a/42989020/706389][always pass =--user=]] - when you are using ~pip install~, [[https://stackoverflow.com/a/42989020/706389][always pass]] =--user=, and *never install third party packages with sudo* (unless you know what you are doing)
- I'm assuming the config directory is =~/.config=, but it's different on Mac/Windows. - throughout the guide I'm assuming the user config directory is =~/.config=, but it's *different on Mac/Windows*.
See [[https://github.com/ActiveState/appdirs/blob/3fe6a83776843a46f20c2e5587afcffe05e03b39/appdirs.py#L187-L190][this]] if you're not sure what's your user config dir. See [[https://github.com/ActiveState/appdirs/blob/3fe6a83776843a46f20c2e5587afcffe05e03b39/appdirs.py#L187-L190][this]] if you're not sure what's your user config dir.
* Setting up the main package * Install main HPI package
This is a *required step* This is a *required step*
You can choose one of the following options: You can choose one of the following options:
** local install ** option 1: install from [[https://pypi.org/project/HPI][PIP]]
This is the most convenient option at the moment: This is the *easiest way*:
: pip3 install --user HPI
** option 2: local/editable install
This is convenient if you're planning to add new modules or change the existing ones.
1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi= 1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
2. Go into the project directory: =cd /path/to/hpi= 2. Go into the project directory: =cd /path/to/hpi=
2. Run ~pip3 install --user -e .~ 2. Run ~pip3 install --user -e .~
This will install the package in 'editable mode'. This will install the package in 'editable mode'.
It will basically be a link to =/path/to/hpi=, which means any changes in the cloned repo will be immediately reflected without need to reinstall anything. It means that any changes to =/path/to/hpi= will be immediately reflected without need to reinstall anything.
It's *extremely* convenient for developing and debugging. It's *extremely* convenient for developing and debugging.
** use without installing ** option 3: use without installing
This is less convenient, but gives you more control. This is less convenient, but gives you more control.
1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi= 1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
@ -53,94 +97,185 @@ This is less convenient, but gives you more control.
After that, you can wrap your command in =with_my= to give it access to ~my.~ modules, e.g. see [[#usage-examples][examples]]. After that, you can wrap your command in =with_my= to give it access to ~my.~ modules, e.g. see [[#usage-examples][examples]].
The benefit of this way is that you get a bit more control, explicitly allowing your scripts to use your data. The benefit of this way is that you get a bit more control, explicitly allowing your scripts to use your data.
** install from PIP
This is still work in progress! ** appendix: optional packages
You can also install some optional packages
* Setting up the modules : pip3 install 'HPI[optional]'
This is an *optional step* as some modules might work without extra setup.
They aren't necessary, but will improve your experience. At the moment these are:
- [[https://github.com/ijl/orjson][orjson]]: a library for serializing data to JSON, used in ~my.core.serialize~ and the ~hpi query~ interface
- [[https://github.com/karlicoss/cachew][cachew]]: automatic caching library, which can greatly speedup data access
- [[https://github.com/python/mypy][mypy]]: mypy is used for checking configs and troubleshooting
- [[https://github.com/borntyping/python-colorlog][colorlog]]: colored formatter for ~logging~ module
- [[https://github.com/Rockhopper-Technologies/enlighten]]: console progress bar library
* Setting up modules
This is an *optional step* as few modules work without extra setup.
But it depends on the specific module. But it depends on the specific module.
See [[file:MODULES.org][MODULES]] to read documentation on specific modules that interest you.
You might also find interesting to read [[file:CONFIGURING.org][CONFIGURING]], where I'm
elaborating on some technical rationales behind the current configuration system.
** private configuration (=my.config=) ** private configuration (=my.config=)
# TODO update this section.. # TODO write about dynamic configuration
# todo add a command to edit config?? e.g. HPI config edit
If you're not planning to use private configuration (some modules don't need it) you can skip straight to the next step. Still, I'd recommend you to read anyway. If you're not planning to use private configuration (some modules don't need it) you can skip straight to the next step. Still, I'd recommend you to read anyway.
First you need to tell the package where to look for your data and external repositories, which is done though a separate (private) package named ~mycfg~. The configuration usually contains paths to the data on your disk, and some modules have extra settings.
The config is simply a *python package* (named =my.config=), expected to be in =~/.config/my=.
If you'd like to change the location of the =my.config= directory, you can set the =MY_CONFIG= environment variable. e.g. in your .bashrc add: ~export MY_CONFIG=$HOME/.my/~
You can see example in ~mycfg_template~. You can copy it somewhere else and modify to your needs. Since it's a Python package, generally it's very *flexible* and there are many ways to set it up.
Some explanations: - *The simplest way*
#+begin_src bash :exports results :results output After installing HPI, run =hpi config create=.
for x in $(find mycfg_template/ | grep -v -E 'mypy_cache|.git|__pycache__|scignore'); do
if [[ -L "$x" ]]; then
echo "l $x -> $(readlink $x)"
elif [[ -d "$x" ]]; then
echo "d $x"
else
echo "f $x"
(echo "---"; cat "$x"; echo "---" ) | sed 's/^/ /'
fi
done
#+end_src
#+RESULTS: This will create an empty config file for you (usually, in =~/.config/my=), which you can edit. Example configuration:
#+begin_example
d mycfg_template/
d mycfg_template/mycfg
f mycfg_template/mycfg/__init__.py
---
class paths:
"""
Feel free to remove this if you don't need it/add your own custom settings and use them
"""
class hypothesis:
export_path = '/tmp/my_demo/backups/hypothesis'
---
d mycfg_template/mycfg/repos
l mycfg_template/mycfg/repos/hypexport -> /tmp/my_demo/hypothesis_repo
#+end_example
As you can see, generally you specify fixed paths (e.g. to backup directory) in ~__init__.py~. #+begin_src python
Feel free to add other files as well though to organize better, it's a real python package after all! import pytz # yes, you can use any Python stuff in the config
Some things (e.g. links to external packages like [[https://github.com/karlicoss/hypexport][hypexport]]) are specified as normal symlinks in ~repos~ directory. class emfit:
That way you get easy imports (e.g. =import mycfg.repos.hypexport.model=) and proper IDE integration. export_path = '/data/exports/emfit'
tz = pytz.timezone('Europe/London')
excluded_sids = []
cache_path = '/tmp/emfit.cache'
class instapaper:
export_path = '/data/exports/instapaper'
class roamresearch:
export_path = '/data/exports/roamresearch'
username = 'karlicoss'
#+end_src
To find out which attributes you need to specify:
- check in [[file:MODULES.org][MODULES]]
- check in [[file:../my/config.py][the default config stubs]]
- if there is nothing there, the easiest is perhaps to skim through the module's code and search for =config.= uses.
For example, if you search for =config.= in [[file:../my/emfit/__init__.py][emfit module]], you'll see that it's using =export_path=, =tz=, =excluded_sids= and =cache_path=.
- or you can just try running them and fill in the attributes Python complains about!
or run =hpi doctor my.modulename=
# TODO link to post about exports? # TODO link to post about exports?
** module dependencies ** module dependencies
Dependencies are different for specific modules you're planning to use, so it's hard to specify. Dependencies are different for specific modules you're planning to use, so it's hard to tell in advance what you'll need.
Generally you can just try using the module and then install missing packages via ~pip3 install --user~, should be fairly straightforward. First thing you should try is just using the module; if it works -- great! If it doesn't (i.e. you get something like =ImportError=):
- try using =hpi module install modulename= (where =<modulename>= is something like =my.hypothesis=, etc.)
This command uses [[https://github.com/karlicoss/HPI/search?l=Python&q=REQUIRES][REQUIRES]] declaration to install the dependencies.
- otherwise manually install missing packages via ~pip3 install --user~
Also please feel free to report if the command above didn't install some dependencies!
* Troubleshooting
# todo replace with_my with it??
HPI comes with a command line tool that can help you detect potential issues. Run:
: hpi doctor
: # alternatively, for more output:
: hpi doctor --verbose
If you only have a few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.
If you're having issues with ~cachew~ or want to show logs to troubleshoot what may be happening, you can pass the debug flag (e.g., ~hpi --debug doctor my.module_name~) or set the ~LOGGING_LEVEL_HPI~ environment variable (e.g., ~LOGGING_LEVEL_HPI=debug hpi query my.module_name~) to print all logs, including the ~cachew~ dependencies. ~LOGGING_LEVEL_HPI~ could also be used to silence ~info~ logs, like ~LOGGING_LEVEL_HPI=warning hpi ...~
If you want to enable logs for a particular module, you can use the
~LOGGING_LEVEL_~ prefix and then the module name with underscores, like
~LOGGING_LEVEL_my_hypothesis=debug hpi query my.hypothesis~
If you want ~HPI~ to autocomplete the module names for you, this comes with shell completion, see [[../misc/completion/][misc/completion]]
If you have any ideas on how to improve it, please let me know!
Here's a screenshot how it looks when everything is mostly good: [[https://user-images.githubusercontent.com/291333/82806066-f7dfe400-9e7c-11ea-8763-b3bee8ada308.png][link]].
If you experience issues, feel free to report, but please attach your:
- OS version
- python version: =python3 --version=
- HPI version: =pip3 show HPI=
- if you see some exception, attach a full log (just make sure there is not private information in it)
- if you think it can help, attach screenshots
** common issues
- run =hpi config check=, it help to spot certain errors
Also really recommended to install =mypy= first, it really helps to spot various trivial errors
- if =hpi= shows you something like 'command not found', try using =python3 -m my.core= instead
This likely means that your =$HOME/.local/bin= directory isn't in your =$PATH=
* Usage examples * Usage examples
If you run your script with ~with_my~ wrapper, you'd have ~my~ in ~PYTHONPATH~ which gives you access to your data from within the script.
** End-to-end Roam Research setup
In [[https://beepb00p.xyz/myinfra-roam.html#export][this]] post you can trace all steps:
- learn how to export your raw data
- integrate it with HPI package
- benefit from HPI integration
- use interactively in ipython
- use with [[https://github.com/karlicoss/orger][Orger]]
- use with [[https://github.com/karlicoss/promnesia][Promnesia]]
If you want to set up a new data source, it could be a good learning reference.
** Polar ** Polar
Polar doesn't require any setup as it accesses the highlights on your filesystem (should be in =~/.polar=). Polar doesn't require any setup as it accesses the highlights on your filesystem (usually in =~/.polar=).
You can try if it works with: You can try if it works with:
: with_my python3 -c 'import my.reading.polar as polar; print(polar.get_entries())' : python3 -c 'import my.polar as polar; print(polar.get_entries())'
** Google Takeout
If you have zip Google Takeout archives, you can use HPI to access it:
- prepare the config =~/.config/my/my/config.py=
#+begin_src python
class google:
# you can pass the directory, a glob, or a single zip file
takeout_path = '/backups/takeouts/*.zip'
#+end_src
- use it:
#+begin_src
$ python3 -c 'import my.media.youtube as yt; print(yt.get_watched()[-1])'
Watched(url='https://www.youtube.com/watch?v=p0t0J_ERzHM', title='Monster magnet meets monster magnet...', when=datetime.datetime(2020, 1, 22, 20, 34, tzinfo=<UTC>))
#+end_src
** Kobo reader ** Kobo reader
Kobo provider allows you access the books you've read along with the highlights and notes. Kobo module allows you to access the books you've read along with the highlights and notes.
It uses exports provided by [[https://github.com/karlicoss/kobuddy][kobuddy]] package. It uses exports provided by [[https://github.com/karlicoss/kobuddy][kobuddy]] package.
- prepare the config - prepare the config
1. Point =ln -sfT /path/to/kobuddy ~/.config/my/config/repos/kobuddy= 1. Install =kobuddy= from PIP
2. Add kobo config to =~/.config/my/config/__init__.py= 2. Add kobo config to =~/.config/my/my/config.py=
#+begin_src python #+begin_src python
class kobo: class kobo:
export_dir = 'path/to/kobo/exports' export_dir = '/backups/to/kobo/'
#+end_src #+end_src
After that you should be able to use it: After that you should be able to use it:
#+begin_src bash #+begin_src bash
with_my python3 -c 'import my.books.kobo as kobo; print(kobo.get_highlights())' python3 -c 'import my.books.kobo as kobo; print(kobo.get_highlights())'
#+end_src #+end_src
** Orger ** Orger
@ -152,9 +287,192 @@ Some examples (assuming you've [[https://github.com/karlicoss/orger#installing][
*** Orger + [[https://github.com/burtonator/polar-bookshelf][Polar]] *** Orger + [[https://github.com/burtonator/polar-bookshelf][Polar]]
This will convert Polar highlights into org-mode: This will mirror Polar highlights as org-mode:
: with_my orger/modules/polar.py --to polar.org : orger/modules/polar.py --to polar.org
** =demo.py= ** =demo.py=
read/run [[../demo.py][demo.py]] for a full demonstration of setting up Hypothesis (it uses public annotations data from Github) read/run [[../demo.py][demo.py]] for a full demonstration of setting up Hypothesis (uses annotations data from a public Github repository)
* Data flow
# todo eh, could publish this as a blog page? dunno
Here, I'll demonstrate how data flows into and from HPI on several examples, starting from the simplest to more complicated.
If you want to see how it looks as a whole, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]]!
** Polar Bookshelf
Polar keeps the data:
- *locally*, on your disk
- in =~/.polar=,
- as a bunch of *JSON files*
It's excellent from all perspectives, except one -- you can only use meaningfully use it through Polar app.
However, you might want to integrate your data elsewhere and use it in ways that Polar developers never even anticipated!
If you check the data layout ([[https://github.com/TheCedarPrince/KnowledgeRepository][example]]), you can see it's messy: scattered across multiple directories, contains raw HTML, obscure entities, etc.
It's understandable from the app developer's perspective, but it makes things frustrating when you want to work with this data.
# todo hmm what if I could share deserialization with Polar app?
Here comes the HPI [[file:../my/polar.py][polar module]]!
: |💾 ~/.polar (raw JSON data) |
: ⇓⇓⇓
: HPI (my.polar)
: ⇓⇓⇓
: < python interface >
So the data is read from the =|💾 filesystem |=, processed/normalized with HPI, which results in a nice programmatic =< interface >= for Polar data.
Note that it doesn't require any extra configuration -- it "just" works because the data is kept locally in the *known location*.
** Google Takeout
# TODO twitter archive might be better here?
Google Takeout exports are, unfortunately, manual (or semi-manual if you do some [[https://beepb00p.xyz/my-data.html#takeout][voodoo]] with mounting Google Drive).
Anyway, say you're doing it once in six months, so you end up with a several archives on your disk:
: /backups/takeout/takeout-20151201.zip
: ....
: /backups/takeout/takeout-20190901.zip
: /backups/takeout/takeout-20200301.zip
Inside the archives.... there is a [[https://www.specytech.com/blog/wp-content/uploads/2019/06/google-takeout-folder.png][bunch]] of random files from all your google services.
Lately, many of them are JSONs, but for example, in 2015 most of it was in HTMLs! It's a nightmare to work with, even when you're an experienced programmer.
# Even within a single data source (e.g. =My Activity/Search=) you have a mix of HTML and JSON files.
# todo eh, I need to actually add JSON processing first
Of course, HPI helps you here by encapsulating all this parsing logic and exposing Python interfaces instead.
: < 🌐 Google |
: ⇓⇓⇓
: { manual download }
: ⇓⇓⇓
: |💾 /backups/takeout/*.zip |
: ⇓⇓⇓
: HPI (my.google.takeout)
: ⇓⇓⇓
: < python interface >
The only thing you need to do is to tell it where to find the files on your disk, via [[file:MODULES.org::#mygoogletakeoutpaths][the config]], because different people use different paths for backups.
# TODO how to emphasize config?
** Reddit
Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!
- first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI
It doesn't deal with all with the complexities of API interactions.
Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.
As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetched by [[https://github.com/karlicoss/rexport][rexport]] library. So the pipeline looks like:
: < 🌐 Reddit |
: ⇓⇓⇓
: { rexport/export.py (automatic, e.g. cron) }
: ⇓⇓⇓
: |💾 /backups/reddit/*.json |
: ⇓⇓⇓
: HPI (my.reddit.rexport)
: ⇓⇓⇓
: < python interface >
So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.
But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/purarue/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
** Twitter
Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.
The reason to use multiple in case of Twitter is:
- there is official Twitter Archive, but it's manual, takes several days to complete and hard to automate.
- there is [[https://github.com/twintproject/twint][twint]], which can get real-time Twitter data via scraping
But Twitter has a limitation and you can't get data past 3200 tweets through API or scraping.
So the idea is to export both data sources on your disk:
: < 🌐 Twitter |
: ⇓⇓ ⇓⇓
: { manual archive download } { twint (automatic, cron) }
: ⇓⇓⇓ ⇓⇓⇓
: |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite |
: .............
# TODO note that the left and right parts of the diagram ('before filesystem' and 'after filesystem') are completely independent!
# if something breaks, you can still read your old data from the filesystem!
What we do next is:
1. Process raw data from twitter archives (manual export, but has all the data)
2. Process raw data from twint database (automatic export, but only recent data)
3. Merge them together, overlaying twint data on top of twitter archive data
: .............
: |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite |
: ⇓⇓⇓ ⇓⇓⇓
: HPI (my.twitter.archive) HPI (my.twitter.twint)
: ⇓ ⇓ ⇓ ⇓
: ⇓ HPI (my.twitter.all) ⇓
: ⇓ ⇓⇓ ⇓
: < python interface> < python interface> < python interface>
For merging the data, we're using a tiny auxiliary module, =my.twitter.all= (It's just 20 lines of code, [[file:../my/twitter/all.py][check it out]]).
Since you have two different sources of raw data, you need to specify two bits of config:
# todo link to modules thing?
: class twint:
: export_path = '/backups/twint/db.sqlite'
: class twitter_archive:
: export_path = '/backups/twitter-archives/*.zip'
Note that you can also just use =my.twitter.archive= or =my.twitter.twint= directly, or set either of paths to empty string: =''=
# #addingmodifying-modules
# Now, say you prefer to use a different library for your Twitter data instead of twint (for whatever reason), and you want to use it
# TODO docs on overlays?
** Connecting to other apps
As a user you might not be so interested in Python interface per se.. but a nice thing about having one is that it's easy to
connect the data with other apps and libraries!
: /---- 💻promnesia --- | browser extension >
: | python interface > ----+---- 💻orger --- |💾 org-mode mirror |
: +-----💻memacs --- |💾 org-mode lifelog |
: +-----💻???? --- | REST api >
: +-----💻???? --- | Datasette >
: \-----💻???? --- | Memex >
See more in [[file:../README.org::#how-do-you-use-it]["How do you use it?"]] section.
Also check out [[https://beepb00p.xyz/myinfra.html#hpi][my personal infrastructure map]] to see where I'm using HPI.
* Adding/modifying modules
# TODO link to 'overlays' documentation?
# TODO don't be afraid to TODO make sure to install in editable mode
- The easiest is just to clone HPI repository and run an editable PIP install (=pip3 install --user -e .=), or via [[#use-without-installing][with_my]] wrapper.
After that you can just edit the code directly, your changes will be reflected immediately, and you will be able to quickly iterate/fix bugs/add new methods.
This is great if you just want to add a few of your own personal modules, or make minimal changes to a few files. If you do much more than that, you may run into possible merge conflicts if/when you update (~git pull~) HPI
# TODO eh. doesn't even have to be in 'my' namespace?? need to check it
- The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to =PYTHONPATH=.
# hmmm seems to be no obvious way to link to a header in a separate file,
# if you want this in both emacs and how github renders org mode
# https://github.com/karlicoss/HPI/pull/160#issuecomment-817318076
See [[file:MODULE_DESIGN.org#addingmodules][MODULE_DESIGN/adding modules]] for more information

View file

@ -0,0 +1,4 @@
#!/bin/bash
set -eux
pip3 install --user "$@" -e main/
pip3 install --user "$@" -e overlay/

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-main',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,11 @@
print(f'[main] {__name__} hello')
def upvotes() -> list[str]:
return [
'reddit upvote1',
'reddit upvote2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,7 @@
print(f'[main] {__name__} hello')
from .common import merge
def tweets() -> list[str]:
from . import gdpr
return merge(gdpr)

View file

@ -0,0 +1,11 @@
print(f'[main] {__name__} hello')
from typing import Protocol
class Source(Protocol):
def tweets(self) -> list[str]:
...
def merge(*sources: Source) -> list[str]:
from itertools import chain
return list(chain.from_iterable(src.tweets() for src in sources))

View file

@ -0,0 +1,9 @@
print(f'[main] {__name__} hello')
def tweets() -> list[str]:
return [
'gdpr tweet 1',
'gdpr tweet 2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,8 @@
print(f'[overlay] {__name__} hello')
from .common import merge
def tweets() -> list[str]:
from . import gdpr
from . import talon
return merge(gdpr, talon)

View file

@ -0,0 +1,9 @@
print(f'[overlay] {__name__} hello')
def tweets() -> list[str]:
return [
'talon tweet 1',
'talon tweet 2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay2',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,13 @@
print(f'[overlay2] {__name__} hello')
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)
def hack_gdpr_module() -> None:
from . import gdpr
tweets_orig = gdpr.tweets
def tweets_patched():
return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
gdpr.tweets = tweets_patched
hack_gdpr_module()

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay3',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,9 @@
import importhook
@importhook.on_import('my.twitter.gdpr')
def on_import(gdpr):
print("EXECUTING IMPORT HOOK!")
tweets_orig = gdpr.tweets
def tweets_patched():
return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
gdpr.tweets = tweets_patched

80
lint
View file

@ -1,80 +0,0 @@
#!/usr/bin/env python3
from pathlib import Path
from pprint import pprint
from itertools import chain
from subprocess import check_call, run, PIPE
import sys
from typing import List, Optional, Iterable
def log(*args):
print(*args, file=sys.stderr)
DIR = Path(__file__).absolute().parent
# hmm. I guess I need to check all subpackages separately
# otherwise pylint doesn't work and mypy doesn't discover everything
# TODO could reuse in readme??
# returns None if not a package
def package_name(p: Path) -> str:
def mname(p: Path):
nosuf = p.with_suffix('')
return str(nosuf).replace('/', '.')
has_init = (p.parent / '__init__.py').exists()
if has_init:
return mname(p.parent)
else:
return mname(p)
def packages() -> Iterable[str]:
yield from sorted(set(
package_name(p.relative_to(DIR)) for p in (DIR / 'my').rglob('*.py')
))
def pylint():
# TODO ugh. pylint still doesn't like checking my.config or my.books
# only top level .py files seem ok??
pass
def mypy(package: str):
return run([
'mypy',
'--color-output', # TODO eh? doesn't work..
'--namespace-packages',
'-p', package,
], stdout=PIPE, stderr=PIPE)
def mypy_all() -> Iterable[Exception]:
from concurrent.futures import ThreadPoolExecutor
pkgs = list(packages())
log(f"Checking {pkgs}")
with ThreadPoolExecutor() as pool:
for p, res in zip(pkgs, pool.map(mypy, pkgs)):
ret = res.returncode
if ret > 0:
log(f'FAILED: {p}')
else:
log(f'OK: {p}')
print(res.stdout.decode('utf8'))
print(res.stderr.decode('utf8'), file=sys.stderr)
try:
res.check_returncode()
except Exception as e:
yield e
def main():
errors = list(mypy_all())
if len(errors) > 0:
sys.exit(1)
if __name__ == '__main__':
main()

View file

@ -1,35 +0,0 @@
Various thoughts on organizing
* Importing external models
- First alternative:
@lru_cache()
def hypexport():
... import_file
# doesn't really work either..
# hypexport = import_file(Path(paths.hypexport.repo) / 'model.py')
+ TODO check pytest friendliness if some paths are missing? Wonder if still easier to control by manually excluding...
- not mypy/pylint friendly at all?
- Second alternative:
symlinks in mycfg and direct import?
+ mypy/pylint friendly
? keeping a symlink to model.py is not much worse than hardcoding path. so it's ok I guess
* Thoughts on organizing imports
- First way:
import mycfg.hypexport_model as hypexport
works, but mycfg is scattered across the repository?
Second way:
from . import mycfg?
doesn't seem to work with subpackages?
right, perhaps symlinking is a good idea after all?...
Third way:
import mycfg.repos.hypexport.model as hypexport
works, but MYPYPATH doesn't seem to be happy...
ok, --namespace-packages solves it..

37
misc/.flake8-karlicoss Normal file
View file

@ -0,0 +1,37 @@
[flake8]
ignore =
## these mess up vertical aligment
E126 # continuation line over-indented
E202 # whitespace before )
E203 # whitespace before ':' (e.g. in dict)
E221 # multiple spaces before operator
E241 # multiple spaces after ,
E251 # unexpected spaces after =
E261 # 2 spaces before comment. I actually think it's fine so TODO enable back later (TODO or not? still alignment)
E271 # multiple spaces after keyword
E272 # multiple spaces before keyword
##
E266 # 'too many leading # in the comment' -- this is just unnecessary pickiness, sometimes it's nice to format a comment
E302 # 2 blank lines
E501 # 'line too long' -- kinda annoying and the default 79 is shit anyway
E702 E704 # multiple statements on one line -- messes with : ... type declataions + sometimes asserts
E731 # suggests always using def instead of lambda
E402 # FIXME module level import -- we want it later
E252 # TODO later -- whitespace around equals?
# F541: f-string is missing placeholders -- perhaps too picky?
# F841 is pretty useful (unused variables). maybe worth making it an error on CI
# for imports: we might want to check these
# F401 good: unused imports
# E401: import order
# F811: redefinition of unused import
# todo from my.core import __NOT_HPI_MODULE__ this needs to be excluded from 'unused'
#
# as a reference:
# https://github.com/purarue/cookiecutter-template/blob/master/%7B%7Bcookiecutter.module_name%7D%7D/setup.cfg
# and this https://github.com/karlicoss/HPI/pull/151
# find ./my | entr flake8 --ignore=E402,E501,E741,W503,E266,E302,E305,E203,E261,E252,E251,E221,W291,E225,E303,E702,E202,F841,E731,E306,E127 E722,E231 my | grep -v __NOT_HPI_MODULE__

105
misc/check-twitter.sh Executable file
View file

@ -0,0 +1,105 @@
#!/bin/bash
# just a hacky script to check twitter module behaviour w.r.t. merging and normalising data
# this checks against orger output for @karlicoss data
set -eu
FILE="$1"
function check() {
x="$1"
if [[ $(rg --count "$x" "$FILE") != "1" ]]; then
echo "FAILED! $x"
fi
}
# only in old twitter archive data + test mentions
check '2010-03-24 Wed 10:02.*@GDRussia подлагивает'
# check that old twitter archive data replaces &lt/&gt
check '2011-05-12 Thu 17:51.*set ><'
# this would probs be from twint or something?
check '2013-06-01 Sat 18:48.*<inputfile'
# https://twitter.com/karlicoss/status/363703394201894912
# the quoted acc was suspended and the tweet is only present in archives?
check '2013-08-03 Sat 16:50.*удивительно, как в одном человеке'
# similar
# https://twitter.com/karlicoss/status/712186968382291968
check '2016-03-22 Tue 07:59.*Очень хорошо'
# RTs are missing from twint
# https://twitter.com/karlicoss/status/925968541458759681
check '2017-11-02 Thu 06:11.*RT @dabeaz: A short esoteric Python'
# twint stopped updating at this point
# https://twitter.com/karlicoss/status/1321488603499954177
check '2020-10-28 Wed 16:26.*@jborichevskiy I feel like for me'
# https://twitter.com/karlicoss/status/808769414984331267
# archive doesn't expland links in 'text' by default, check we're doing that in HPI
# NOTE: hmm twint adds an extra whitespace here before the link?
check '2016-12-13 Tue 20:23.*TIL:.*pypi.python.org/pypi/coloredlogs'
# https://twitter.com/karlicoss/status/472151454044917761
# archive isn't expanding images by default
check '2014-05-29 Thu 23:04.*Выколол сингулярность.*pic.twitter.com/M6XRN1n7KW'
# https://twitter.com/karlicoss/status/565648186816335873
# for some reason missing from twint??
check '2015-02-11 Wed 23:06.*separation confirmed'
# mentions were missing from twint at some point, check they are still present..
# https://twitter.com/karlicoss/status/1228225797283966976
check '2020-02-14 Fri 07:53.*thomas536.*looks like a very cool blog'
# just a random timestamp check. RT then reply shortly after -- good check.
# https://twitter.com/karlicoss/status/341512959694082049
check '2013-06-03 Mon 11:13.*RT @osenin'
# https://twitter.com/karlicoss/status/341513515749736448
check '2013-06-03 Mon 11:15.*@osenin'
# def was tweeted at 00:00 MSK, so a good timezone check
# id 550396141914058752
check '2014-12-31 Wed 21:00.*2015 заебал'
# for some reason is gone, and wasn't in twidump/twint
# https://twitter.com/karlicoss/status/1393312193945513985
check '2021-05-14 Fri 21:08.*RT @SNunoPerez: Me explaining Rage.*'
# make sure there is a single occurrence (hence, correct tzs)
check 'A short esoteric Python'
# https://twitter.com/karlicoss/status/1499174823272099842
check 'It would be a really good time for countries'
# https://twitter.com/karlicoss/status/1530303537476947968
check 'so there is clearly a pattern'
# https://twitter.com/karlicoss/status/1488942357303238673
# check URL expansion for Talon
check '2022-02-02 Wed 18:28.*You are in luck!.*https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode'
# https://twitter.com/karlicoss/status/349168455964033024
# check link which is only in twidump
check '2013-06-24 Mon 14:13.*RT @gorod095: Нашел недавно в букинист'
# some older statuses, useful to test that all input data is properly detected
check '2010-04-01 Thu 11:34'
check '2010-06-28 Mon 23:42'
# https://twitter.com/karlicoss/status/22916704915
# this one is weird, just disappeared for no reason between 2021-12-22 and 2022-03-15
# and the account isn't suspended etc. maybe it was temporary private or something?
check '2010-09-03 Fri 20:11.*Джобс'
# TODO check likes as well

84
misc/check_legacy_init_py.py Executable file
View file

@ -0,0 +1,84 @@
#!/usr/bin/env python3
# NOTE: prerequisites for this test:
# fbmessengerexport installed
# config configured (can set it to '' though)
from pathlib import Path
from subprocess import Popen, run, PIPE
from tempfile import TemporaryDirectory
import logzero # type: ignore[import]
logger = logzero.logger
MSG = 'my.fbmessenger is DEPRECATED'
def expect(*cmd: str, should_warn: bool=True) -> None:
res = run(cmd, stderr=PIPE)
errb = res.stderr; assert errb is not None
err = errb.decode('utf8')
if should_warn:
assert MSG in err, res
else:
assert MSG not in err, res
assert res.returncode == 0, res
def _check(*cmd: str, should_warn: bool, run_as_cmd: bool=True) -> None:
expecter = lambda *cmd: expect(*cmd, should_warn=should_warn)
if cmd[0] == '-c':
[_, code] = cmd
if run_as_cmd:
expecter('python3', '-c', code)
# check as a script
with TemporaryDirectory() as tdir:
script = Path(tdir) / 'script.py'
script.write_text(code)
expecter('python3', str(script))
else:
expecter('python3', *cmd)
what = 'warns' if should_warn else ' ' # meh
logger.info(f"PASSED: {what}: {repr(cmd)}")
def check_warn(*cmd: str, **kwargs) -> None:
_check(*cmd, should_warn=True, **kwargs)
def check_ok(*cmd: str, **kwargs) -> None:
_check(*cmd, should_warn=False, **kwargs)
# NOTE these three are actually sort of OK, they are allowed when it's a proper namespace package with all.py etc.
# but more likely it means legacy behaviour or just misusing the package?
# worst case it's just a warning I guess
check_warn('-c', 'from my import fbmessenger')
check_warn('-c', 'import my.fbmessenger')
check_warn('-c', 'from my.fbmessenger import *')
# note: dump_chat_history should really be deprecated, but it's a quick way to check we actually fell back to fbmessenger/export.py
# NOTE: this is the most common legacy usecase
check_warn('-c', 'from my.fbmessenger import messages, dump_chat_history')
check_warn('-m', 'my.core', 'query' , 'my.fbmessenger.messages', '-o', 'pprint', '--limit=10')
check_warn('-m', 'my.core', 'doctor', 'my.fbmessenger')
check_warn('-m', 'my.core', 'module', 'requires', 'my.fbmessenger')
# todo kinda annoying it doesn't work when executed as -c (but does as script!)
# presumably because doesn't have proper line number information?
# either way, it'a a bit of a corner case, the script behaviour is more important
check_ok ('-c', 'from my.fbmessenger import export', run_as_cmd=False)
check_ok ('-c', 'import my.fbmessenger.export')
check_ok ('-c', 'from my.fbmessenger.export import *')
check_ok ('-c', 'from my.fbmessenger.export import messages, dump_chat_history')
check_ok ('-m', 'my.core', 'query' , 'my.fbmessenger.export.messages', '-o', 'pprint', '--limit=10')
check_ok ('-m', 'my.core', 'doctor', 'my.fbmessenger.export')
check_ok ('-m', 'my.core', 'module', 'requires', 'my.fbmessenger.export')
# NOTE:
# to check that overlays work, run something like
# PYTHONPATH=misc/overlay_for_init_py_test/ hpi query my.fbmessenger.all.messages -s -o pprint --limit=10
# you should see 1, 2, 3 from mixin.py
# TODO would be nice to add an automated test for this
# TODO with reddit, currently these don't work properly at all
# only when imported from scripts etc?

37
misc/completion/README.md Normal file
View file

@ -0,0 +1,37 @@
To enable completion for the `hpi` command:
If you don't want to use the files here, you can do this when you launch your shell like:
```bash
eval "$(_HPI_COMPLETE=bash_source hpi)" # in ~/.bashrc
eval "$(_HPI_COMPLETE=zsh_source hpi)" # in ~/.zshrc
eval "$(_HPI_COMPLETE=fish_source hpi)" # in ~/.config/fish/config.fish
```
That is slightly slower since its generating the completion code on the fly -- see [click docs](https://click.palletsprojects.com/en/8.0.x/shell-completion/#enabling-completion) for more info
To use the generated completion files in this repository, you need to source the file in `./bash`, `./zsh`, or `./fish` depending on your shell.
If you don't have HPI cloned locally, after installing `HPI` you can generate the file yourself using one of the commands above. For example, for `bash`: `_HPI_COMPLETE=bash_source hpi > ~/.config/hpi_bash_completion`, and then source it like `source ~/.config/hpi_bash_completion`
### bash
Put `source /path/to/hpi/repo/misc/completion/bash/_hpi` in your `~/.bashrc`
### zsh
You can either source the file:
`source /path/to/hpi/repo/misc/completion/zsh/_hpi`
..or add the directory to your `fpath` to load it lazily:
`fpath=("/path/to/hpi/repo/misc/completion/zsh/" "${fpath[@]}")` (Note: the directory, not the script `_hpi`)
If your zsh configuration doesn't automatically run `compinit`, after modifying your `fpath` you should:
`autoload -Uz compinit && compinit`
### fish
`cp ./fish/hpi.fish ~/.config/fish/completions/`, then restart your shell

29
misc/completion/bash/_hpi Normal file
View file

@ -0,0 +1,29 @@
_hpi_completion() {
local IFS=$'\n'
local response
response=$(env COMP_WORDS="${COMP_WORDS[*]}" COMP_CWORD=$COMP_CWORD _HPI_COMPLETE=bash_complete $1)
for completion in $response; do
IFS=',' read type value <<< "$completion"
if [[ $type == 'dir' ]]; then
COMPREPLY=()
compopt -o dirnames
elif [[ $type == 'file' ]]; then
COMPREPLY=()
compopt -o default
elif [[ $type == 'plain' ]]; then
COMPREPLY+=($value)
fi
done
return 0
}
_hpi_completion_setup() {
complete -o nosort -F _hpi_completion hpi
}
_hpi_completion_setup;

View file

@ -0,0 +1,18 @@
function _hpi_completion;
set -l response (env _HPI_COMPLETE=fish_complete COMP_WORDS=(commandline -cp) COMP_CWORD=(commandline -t) hpi);
for completion in $response;
set -l metadata (string split "," $completion);
if test $metadata[1] = "dir";
__fish_complete_directories $metadata[2];
else if test $metadata[1] = "file";
__fish_complete_path $metadata[2];
else if test $metadata[1] = "plain";
echo $metadata[2];
end;
end;
end;
complete --no-files --command hpi --arguments "(_hpi_completion)";

12
misc/completion/generate Executable file
View file

@ -0,0 +1,12 @@
#!/usr/bin/env bash
# assumes HPI is already installed
# generates the completion files
cd "$(realpath "$(dirname "${BASH_SOURCE[0]}")")"
mkdir -p ./bash ./zsh ./fish
_HPI_COMPLETE=fish_source hpi >./fish/hpi.fish
# underscores to allow these directories to be lazily loaded
_HPI_COMPLETE=zsh_source hpi >./zsh/_hpi
_HPI_COMPLETE=bash_source hpi >./bash/_hpi

41
misc/completion/zsh/_hpi Normal file
View file

@ -0,0 +1,41 @@
#compdef hpi
_hpi_completion() {
local -a completions
local -a completions_with_descriptions
local -a response
(( ! $+commands[hpi] )) && return 1
response=("${(@f)$(env COMP_WORDS="${words[*]}" COMP_CWORD=$((CURRENT-1)) _HPI_COMPLETE=zsh_complete hpi)}")
for type key descr in ${response}; do
if [[ "$type" == "plain" ]]; then
if [[ "$descr" == "_" ]]; then
completions+=("$key")
else
completions_with_descriptions+=("$key":"$descr")
fi
elif [[ "$type" == "dir" ]]; then
_path_files -/
elif [[ "$type" == "file" ]]; then
_path_files -f
fi
done
if [ -n "$completions_with_descriptions" ]; then
_describe -V unsorted completions_with_descriptions -U
fi
if [ -n "$completions" ]; then
compadd -U -V unsorted -a completions
fi
}
if [[ $zsh_eval_context[-1] == loadautofunc ]]; then
# autoload from fpath, call function directly
_hpi_completion "$@"
else
# eval/source/. command, register function for later
compdef _hpi_completion hpi
fi

View file

@ -0,0 +1,7 @@
from my.fbmessenger import export
from . import mixin
def messages():
yield from mixin.messages()
yield from export.messages()

View file

@ -0,0 +1,2 @@
def messages():
yield from ['1', '2', '3']

63
misc/repl.py Executable file
View file

@ -0,0 +1,63 @@
#!/usr/bin/env python3
# M-x run-python (raise window so it doesn't hide)
# ?? python-shell-send-defun
# C-c C-r python-shell-send-region
# shit, it isn't autoscrolling??
# maybe add hook
# (setq comint-move-point-for-output t) ;; https://github.com/jorgenschaefer/elpy/issues/1641#issuecomment-528355368
#
from itertools import islice, groupby
from more_itertools import ilen, bucket
from importlib import reload
import sys
# todo function to reload hpi?
todel = [m for m in sys.modules if m.startswith('my.')]
for m in todel: del sys.modules[m]
import my
# todo add to doc?
from my.core import get_files
import my.bluemaestro as M
from my.config import bluemaestro as BC
# BC.export_path = get_files(BC.export_path)[:40]
# print(list(M.measurements())[:10])
M.fill_influxdb()
ffwf
#
from my.config import rescuetime as RC
# todo ugh. doesn't work??
# from my.core.cachew import disable_cachew
# disable_cachew()
# RC.export_path = get_files(RC.export_path)[-1:]
import my.rescuetime as M
# print(len(list(M.entries())))
M.fill_influxdb()
print(M.dataframe())
e = M.entries()
e = list(islice(e, 0, 10))
key = lambda x: 'ERROR' if isinstance(x, Exception) else x.activity
# TODO move to errors module? how to preserve type signature?
# b = bucket(e, key=key)
# for k in b:
# g = b[k] # meh? should maybe sort
# print(k, ilen(g))
from collections import Counter
print(Counter(key(x) for x in e))

View file

@ -1,10 +0,0 @@
# shared Rss stuff
from typing import NamedTuple
class Subscription(NamedTuple):
# TODO date?
title: str
url: str
id: str
subscribed: bool=True

116
my/arbtt.py Normal file
View file

@ -0,0 +1,116 @@
'''
[[https://github.com/nomeata/arbtt#arbtt-the-automatic-rule-based-time-tracker][Arbtt]] time tracking
'''
from __future__ import annotations
REQUIRES = ['ijson', 'cffi']
# NOTE likely also needs libyajl2 from apt or elsewhere?
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from pathlib import Path
def inputs() -> Sequence[Path]:
try:
from my.config import arbtt as user_config
except ImportError:
from my.core.warnings import low
low("Couldn't find 'arbtt' config section, falling back to the default capture.log (usually in HOME dir). Add 'arbtt' section with logfiles = '' to suppress this warning.")
return []
else:
from .core import get_files
return get_files(user_config.logfiles)
from my.core import Json, PathIsh, datetime_aware
from my.core.compat import fromisoformat
@dataclass
class Entry:
'''
For the format reference, see
https://github.com/nomeata/arbtt/blob/e120ad20b9b8e753fbeb02041720b7b5b271ab20/src/DumpFormat.hs#L39-L46
'''
json: Json
# inactive time -- in ms
@property
def dt(self) -> datetime_aware:
# contains utc already
# TODO after python>=3.11, could just use fromisoformat
ds = self.json['date']
elen = 27
lds = len(ds)
if lds < elen:
# ugh. sometimes contains less that 6 decimal points
ds = ds[:-1] + '0' * (elen - lds) + 'Z'
elif lds > elen:
# and sometimes more...
ds = ds[:elen - 1] + 'Z'
return fromisoformat(ds)
@property
def active(self) -> str | None:
# NOTE: WIP, might change this in the future...
ait = (w for w in self.json['windows'] if w['active'])
a = next(ait, None)
if a is None:
return None
a2 = next(ait, None)
assert a2 is None, a2 # hopefully only one can be active in a time?
p = a['program']
t = a['title']
# todo perhaps best to keep it structured, e.g. for influx
return f'{p}: {t}'
# todo multiple threads? not sure if would help much... (+ need to find offset somehow?)
def entries() -> Iterable[Entry]:
inps = list(inputs())
base: list[PathIsh] = ['arbtt-dump', '--format=json']
cmds: list[list[PathIsh]]
if len(inps) == 0:
cmds = [base] # rely on default
else:
# otherwise, 'merge' them
cmds = [[*base, '--logfile', f] for f in inps]
from subprocess import PIPE, Popen
import ijson.backends.yajl2_cffi as ijson # type: ignore
for cmd in cmds:
with Popen(cmd, stdout=PIPE) as p:
out = p.stdout; assert out is not None
for json in ijson.items(out, 'item'):
yield Entry(json=json)
def fill_influxdb() -> None:
from .core.freezer import Freezer
from .core.influxdb import magic_fill
freezer = Freezer(Entry)
fit = (freezer.freeze(e) for e in entries())
# TODO crap, influxdb doesn't like None https://github.com/influxdata/influxdb/issues/7722
# wonder if can check it statically/warn?
fit = (f for f in fit if f.active is not None)
# todo could tag with computer name or something...
# todo should probably also tag with 'program'?
magic_fill(fit, name=f'{entries.__module__}:{entries.__name__}')
from .core import Stats, stat
def stats() -> Stats:
return stat(entries)

261
my/bluemaestro.py Normal file
View file

@ -0,0 +1,261 @@
"""
[[https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger][Bluemaestro]] temperature/humidity/pressure monitor
"""
from __future__ import annotations
# todo most of it belongs to DAL... but considering so few people use it I didn't bother for now
import re
import sqlite3
from abc import abstractmethod
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from datetime import datetime, timedelta
from pathlib import Path
from typing import Protocol
import pytz
from my.core import (
Paths,
Res,
Stats,
get_files,
make_logger,
stat,
unwrap,
)
from my.core.cachew import mcachew
from my.core.pandas import DataFrameT, as_dataframe
from my.core.sqlite import sqlite_connect_immutable
class config(Protocol):
@property
@abstractmethod
def export_path(self) -> Paths:
raise NotImplementedError
@property
def tz(self) -> pytz.BaseTzInfo:
# fixme: later, rely on the timezone provider
# NOTE: the timezone should be set with respect to the export date!!!
return pytz.timezone('Europe/London')
# TODO when I change tz, check the diff
def make_config() -> config:
from my.config import bluemaestro as user_config
class combined_config(user_config, config): ...
return combined_config()
logger = make_logger(__name__)
def inputs() -> Sequence[Path]:
cfg = make_config()
return get_files(cfg.export_path)
Celsius = float
Percent = float
mBar = float
@dataclass
class Measurement:
dt: datetime # todo aware/naive
temp: Celsius
humidity: Percent
pressure: mBar
dewpoint: Celsius
def is_bad_table(name: str) -> bool:
# todo hmm would be nice to have a hook that can patch any module up to
delegate = getattr(config, 'is_bad_table', None)
return False if delegate is None else delegate(name)
@mcachew(depends_on=inputs)
def measurements() -> Iterable[Res[Measurement]]:
cfg = make_config()
tz = cfg.tz
# todo ideally this would be via arguments... but needs to be lazy
paths = inputs()
total = len(paths)
width = len(str(total))
last: datetime | None = None
# tables are immutable, so can save on processing..
processed_tables: set[str] = set()
for idx, path in enumerate(paths):
logger.info(f'processing [{idx:>{width}}/{total:>{width}}] {path}')
tot = 0
new = 0
# todo assert increasing timestamp?
with sqlite_connect_immutable(path) as db:
db_dt: datetime | None = None
try:
datas = db.execute(
f'SELECT "{path.name}" as name, Time, Temperature, Humidity, Pressure, Dewpoint FROM data ORDER BY log_index'
)
oldfmt = True
[(db_dts,)] = db.execute('SELECT last_download FROM info')
if db_dts == 'N/A':
# ??? happens for 20180923-20180928
continue
if db_dts.endswith(':'):
db_dts += '00' # wtf.. happens on some day
db_dt = tz.localize(datetime.strptime(db_dts, '%Y-%m-%d %H:%M:%S'))
except sqlite3.OperationalError:
# Right, this looks really bad.
# The device doesn't have internal time & what it does is:
# 1. every X seconds, record a datapoint, store it in the internal memory
# 2. on sync, take the phone's datetime ('now') and then ASSIGN the timestamps to the collected data
# as now, now - X, now - 2X, etc
#
# that basically means that for example, hourly timestamps are completely useless? because their error is about 1h
# yep, confirmed on some historic exports. seriously, what the fuck???
#
# The device _does_ have an internal clock, but it's basically set to 0 every time you update settings
# So, e.g. if, say, at 17:15 you set the interval to 3600, the 'real' timestamps would be
# 17:15, 18:15, 19:15, etc
# But depending on when you export, you might get
# 17:35, 18:35, 19:35; or 17:55, 18:55, 19:55, etc
# basically all you guaranteed is that the 'correct' interval is within the frequency
# it doesn't seem to keep the reference time in the database
#
# UPD: fucking hell, so you can set the reference date in the settings (calcReferenceUnix field in meta db)
# but it's not set by default.
log_tables = [c[0] for c in db.execute('SELECT name FROM sqlite_sequence WHERE name LIKE "%_log"')]
log_tables = [t for t in log_tables if t not in processed_tables]
processed_tables |= set(log_tables)
# todo use later?
frequencies = [list(db.execute(f'SELECT interval from {t.replace("_log", "_meta")}'))[0][0] for t in log_tables] # noqa: RUF015
# todo could just filter out the older datapoints?? dunno.
# eh. a bit horrible, but seems the easiest way to do it?
# note: for some reason everything in the new table multiplied by 10
query = ' UNION '.join(
f'SELECT "{t}" AS name, unix, tempReadings / 10.0, humiReadings / 10.0, pressReadings / 10.0, dewpReadings / 10.0 FROM {t}'
for t in log_tables
)
if len(log_tables) > 0: # ugh. otherwise end up with syntax error..
query = f'SELECT * FROM ({query}) ORDER BY name, unix'
datas = db.execute(query)
oldfmt = False
db_dt = None
for (name, tsc, temp, hum, pres, dewp) in datas:
if is_bad_table(name):
continue
# note: bluemaestro keeps local datetime
if oldfmt:
tss = tsc.replace('Juli', 'Jul').replace('Aug.', 'Aug')
dt = datetime.strptime(tss, '%Y-%b-%d %H:%M')
dt = tz.localize(dt)
assert db_dt is not None
else:
# todo cache?
m = re.search(r'_(\d+)_', name)
assert m is not None
export_ts = int(m.group(1))
db_dt = datetime.fromtimestamp(export_ts / 1000, tz=tz)
dt = datetime.fromtimestamp(tsc / 1000, tz=tz)
## sanity checks (todo make defensive/configurable?)
# not sure how that happens.. but basically they'd better be excluded
lower = timedelta(days=6000 / 24) # ugh some time ago I only did it once in an hour.. in theory can detect from meta?
upper = timedelta(days=10) # kinda arbitrary
if not (db_dt - lower < dt < db_dt + timedelta(days=10)):
# todo could be more defenive??
yield RuntimeError('timestamp too far out', path, name, db_dt, dt)
continue
# err.. sometimes my values are just interleaved with these for no apparent reason???
if (temp, hum, pres, dewp) == (-144.1, 100.0, 1152.5, -144.1):
yield RuntimeError('the weird sensor bug')
continue
assert -60 <= temp <= 60, (path, dt, temp)
##
tot += 1
if last is not None and last >= dt:
continue
# todo for performance, pass 'last' to sqlite instead?
last = dt
new += 1
p = Measurement(
dt=dt,
temp=temp,
pressure=pres,
humidity=hum,
dewpoint=dewp,
)
yield p
logger.debug(f'{path}: new {new}/{tot}')
# logger.info('total items: %d', len(merged))
# for k, v in merged.items():
# # TODO shit. quite a few of them have varying values... how is that freaking possible????
# # most of them are within 0.5 degree though... so just ignore?
# if isinstance(v, set) and len(v) > 1:
# print(k, v)
# for k, v in merged.items():
# yield Point(dt=k, temp=v) # meh?
def stats() -> Stats:
return stat(measurements)
def dataframe() -> DataFrameT:
"""
%matplotlib gtk
from my.bluemaestro import dataframe
dataframe().plot()
"""
df = as_dataframe(measurements(), schema=Measurement)
# todo not sure how it would handle mixed timezones??
# todo hmm, not sure about setting the index
return df.set_index('dt')
def fill_influxdb() -> None:
from my.core import influxdb
influxdb.fill(measurements(), measurement=__name__)
def check() -> None:
temps = list(measurements())
latest = temps[:-2]
prev = unwrap(latest[-2]).dt
last = unwrap(latest[-1]).dt
# todo stat should expose a dataclass?
# TODO ugh. might need to warn about points past 'now'??
# the default shouldn't allow points in the future...
#
# TODO also needs to be filtered out on processing, should be rejected on the basis of export date?
POINTS_STORED = 6000 # on device?
FREQ_SEC = 60
SECS_STORED = POINTS_STORED * FREQ_SEC
HOURS_STORED = POINTS_STORED / (60 * 60 / FREQ_SEC) # around 4 days
NOW = datetime.now()
assert NOW - last < timedelta(hours=HOURS_STORED / 2), f'old backup! {last}'
assert last - prev < timedelta(minutes=3), f'bad interval! {last - prev}'
single = (last - prev).seconds

View file

@ -1,95 +0,0 @@
#!/usr/bin/python3
"""
[[https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger][Bluemaestro]] temperature/humidity/pressure monitor
"""
# TODO eh, most of it belongs to DAL
import sqlite3
from datetime import datetime
from itertools import chain, islice
from pathlib import Path
from typing import Any, Dict, Iterable, NamedTuple, Set
from ..common import mcachew, LazyLogger, get_files
from my.config import bluemaestro as config
logger = LazyLogger('bluemaestro', level='debug')
def _get_exports():
return get_files(config.export_path, glob='*.db')
class Measurement(NamedTuple):
dt: datetime
temp: float
@mcachew(cache_path=config.cache_path)
def _iter_measurements(dbs) -> Iterable[Measurement]:
# I guess we can affort keeping them in sorted order
points: Set[Measurement] = set()
# TODO do some sanity check??
for f in dbs:
# err = f'{f}: mismatch: {v} vs {value}'
# if abs(v - value) > 0.4:
# logger.warning(err)
# # TODO mm. dunno how to mark errors properly..
# # raise AssertionError(err)
# else:
# pass
with sqlite3.connect(str(f)) as db:
datas = list(db.execute('select * from data'))
for _, tss, temp, hum, pres, dew in datas:
# TODO is that utc???
tss = tss.replace('Juli', 'Jul').replace('Aug.', 'Aug')
dt = datetime.strptime(tss, '%Y-%b-%d %H:%M')
p = Measurement(
dt=dt,
temp=temp,
# TODO use pressure and humidity as well
)
if p in points:
continue
points.add(p)
# TODO make properly iterative?
for p in sorted(points, key=lambda p: p.dt):
yield p
# logger.info('total items: %d', len(merged))
# TODO assert frequency?
# for k, v in merged.items():
# # TODO shit. quite a few of them have varying values... how is that freaking possible????
# # most of them are within 0.5 degree though... so just ignore?
# if isinstance(v, set) and len(v) > 1:
# print(k, v)
# for k, v in merged.items():
# yield Point(dt=k, temp=v) # meh?
# TODO does it even have to be a dict?
# @dictify(key=lambda p: p.dt)
def measurements(exports=_get_exports()):
yield from _iter_measurements(exports)
def dataframe():
"""
%matplotlib gtk
from my.bluemaestro import get_dataframe
get_dataframe().plot()
"""
import pandas as pd # type: ignore
return pd.DataFrame(p._asdict() for p in measurements()).set_index('dt')
def main():
ll = list(measurements(_get_exports()))
print(len(ll))
if __name__ == '__main__':
main()

View file

@ -1,29 +0,0 @@
#!/usr/bin/python3
import logging
from datetime import timedelta, datetime
from my.bluemaestro import measurements, logger
# TODO move this to backup checker?
def main():
temps = list(measurements())
latest = temps[:-2]
prev, _ = latest[-2]
last, _ = latest[-1]
POINTS_STORED = 6000
FREQ_SEC = 60
SECS_STORED = POINTS_STORED * FREQ_SEC
HOURS_STORED = POINTS_STORED / (60 * 60 / FREQ_SEC) # around 4 days
NOW = datetime.now()
assert NOW - last < timedelta(hours=HOURS_STORED / 2), f'old backup! {last}'
assert last - prev < timedelta(minutes=3), f'bad interval! {last - prev}'
single = (last - prev).seconds
if __name__ == '__main__':
main()

155
my/body/blood.py Executable file → Normal file
View file

@ -1,123 +1,134 @@
""" """
Blood tracking Blood tracking (manual org-mode entries)
""" """
from __future__ import annotations
from collections.abc import Iterable
from datetime import datetime from datetime import datetime
from typing import Iterable, NamedTuple, Optional from typing import NamedTuple
from itertools import chain
import porg import orgparse
from ..common import listify import pandas as pd
from ..error import Res, echain
from my.config import blood as config # type: ignore[attr-defined]
from kython.org import parse_org_date from ..core.error import Res
from ..core.orgmode import one_table, parse_org_datetime
from my.config import blood as config
import pandas as pd # type: ignore
class Entry(NamedTuple): class Entry(NamedTuple):
dt: datetime dt: datetime
ket: Optional[float]=None ketones : float | None=None
glu: Optional[float]=None glucose : float | None=None
vitd: Optional[float]=None vitamin_d : float | None=None
b12: Optional[float]=None vitamin_b12 : float | None=None
hdl: Optional[float]=None hdl : float | None=None
ldl: Optional[float]=None ldl : float | None=None
trig: Optional[float]=None triglycerides: float | None=None
extra: Optional[str]=None source : str | None=None
extra : str | None=None
Result = Res[Entry] Result = Res[Entry]
class ParseError(Exception):
pass
def try_float(s: str) -> float | None:
def try_float(s: str) -> Optional[float]:
l = s.split() l = s.split()
if len(l) == 0: if len(l) == 0:
return None return None
# meh. this is to strip away HI/LO? Maybe need extract_float instead or something
x = l[0].strip() x = l[0].strip()
if len(x) == 0: if len(x) == 0:
return None return None
return float(x) return float(x)
def iter_gluc_keto_data() -> Iterable[Result]: def glucose_ketones_data() -> Iterable[Result]:
o = porg.Org.from_file(str(config.blood_log)) o = orgparse.load(config.blood_log)
tbl = o.xpath('//table') [n] = [x for x in o if x.heading == 'glucose/ketones']
for l in tbl.lines: tbl = one_table(n)
kets = l['ket'].strip() # todo some sort of sql-like interface for org tables might be ideal?
glus = l['glu'].strip() for l in tbl.as_dicts:
kets = l['ket']
glus = l['glu']
extra = l['notes'] extra = l['notes']
dt = parse_org_date(l['datetime']) dt = parse_org_datetime(l['datetime'])
assert isinstance(dt, datetime)
ket = try_float(kets)
glu = try_float(glus)
yield Entry(
dt=dt,
ket=ket,
glu=glu,
extra=extra,
)
def iter_tests_data() -> Iterable[Result]:
o = porg.Org.from_file(str(config.blood_tests_log))
tbl = o.xpath('//table')
for d in tbl.lines:
try: try:
dt = parse_org_date(d['datetime'])
assert isinstance(dt, datetime) assert isinstance(dt, datetime)
# TODO rest ket = try_float(kets)
glu = try_float(glus)
except Exception as e:
ex = RuntimeError(f'While parsing {l}')
ex.__cause__ = e
yield ex
else:
yield Entry(
dt=dt,
ketones=ket,
glucose=glu,
extra=extra,
)
def blood_tests_data() -> Iterable[Result]:
o = orgparse.load(config.blood_log)
[n] = [x for x in o if x.heading == 'blood tests']
tbl = one_table(n)
for d in tbl.as_dicts:
try:
dt = parse_org_datetime(d['datetime'])
assert isinstance(dt, datetime), dt
F = lambda n: try_float(d[n]) F = lambda n: try_float(d[n])
yield Entry( yield Entry(
dt=dt, dt=dt,
vitd=F('VD nm/L'), vitamin_d =F('VD nm/L'),
b12 =F('B12 pm/L'), vitamin_b12 =F('B12 pm/L'),
hdl =F('HDL mm/L'), hdl =F('HDL mm/L'),
ldl =F('LDL mm/L'), ldl =F('LDL mm/L'),
trig=F('Trig mm/L'), triglycerides=F('Trig mm/L'),
extra=d['misc'], source =d['source'],
extra =d['notes'],
) )
except Exception as e: except Exception as e:
print(e) ex = RuntimeError(f'While parsing {d}')
yield echain(ParseError(str(d)), e) ex.__cause__ = e
yield ex
def data(): def data() -> Iterable[Result]:
datas = list(chain(iter_gluc_keto_data(), iter_tests_data())) from itertools import chain
return list(sorted(datas, key=lambda d: getattr(d, 'dt', datetime.min)))
from ..core.error import sort_res_by
datas = chain(glucose_ketones_data(), blood_tests_data())
return sort_res_by(datas, key=lambda e: e.dt)
@listify(wrapper=pd.DataFrame) def dataframe() -> pd.DataFrame:
def dataframe(): rows = []
for d in data(): for x in data():
if isinstance(d, Exception): if isinstance(x, Exception):
yield {'error': str(d)} # todo use some core helper? this is a pretty common operation
d = {'error': str(x)}
else: else:
yield d._asdict() d = x._asdict()
rows.append(d)
return pd.DataFrame(rows)
def stats():
from ..core import stat
return stat(data)
def test(): def test():
print(dataframe()) print(dataframe())
assert len(dataframe()) > 10 assert len(dataframe()) > 10
def main():
print(data())
if __name__ == '__main__':
main()

17
my/body/exercise/all.py Normal file
View file

@ -0,0 +1,17 @@
'''
Combined exercise data
'''
from ...core.pandas import DataFrameT, check_dataframe
@check_dataframe
def dataframe() -> DataFrameT:
# this should be somehow more flexible...
import pandas as pd
from ...endomondo import dataframe as EDF
from ...runnerup import dataframe as RDF
return pd.concat([
EDF(),
RDF(),
])

View file

@ -0,0 +1,43 @@
'''
Cardio data, filtered from various data sources
'''
from ...core.pandas import DataFrameT, check_dataframe
CARDIO = {
'Running',
'Running, treadmill',
'Cross training',
'Walking',
'Skating',
'Spinning',
'Skiing',
'Table tennis',
'Rope jumping',
}
# todo if it has HR data, take it into the account??
NOT_CARDIO = {
'Other',
}
@check_dataframe
def dataframe() -> DataFrameT:
assert len(CARDIO.intersection(NOT_CARDIO)) == 0, (CARDIO, NOT_CARDIO)
from .all import dataframe as DF
df = DF()
# not sure...
# df = df[df['heart_rate_avg'].notna()]
is_cardio = df['sport'].isin(CARDIO)
not_cardio = df['sport'].isin(NOT_CARDIO)
neither = ~is_cardio & ~not_cardio
# if neither -- count, but warn? or show error?
# todo error about the rest??
# todo append errors?
df.loc[neither, 'error'] = 'Unexpected exercise type, please mark as cardio or non-cardio'
df = df[is_cardio | neither]
return df

View file

@ -0,0 +1,190 @@
'''
My cross trainer exercise data, arbitrated from different sources (mainly, Endomondo and manual text notes)
This is probably too specific to my needs, so later I will move it away to a personal 'layer'.
For now it's worth keeping it here as an example and perhaps utility functions might be useful for other HPI modules.
'''
from __future__ import annotations
from datetime import datetime, timedelta
import pytz
from my.config import exercise as config
from ...core.orgmode import Table, TypedTable, collect, parse_org_datetime
from ...core.pandas import DataFrameT
from ...core.pandas import check_dataframe as cdf
# FIXME how to attach it properly?
tz = pytz.timezone('Europe/London')
def tzify(d: datetime) -> datetime:
assert d.tzinfo is None, d
return tz.localize(d)
# todo predataframe?? entries??
def cross_trainer_data():
# FIXME some manual entries in python
# I guess just convert them to org
import orgparse
# todo should use all org notes and just query from them?
wlog = orgparse.load(config.workout_log)
[table] = collect(
wlog,
lambda n: [] if n.heading != 'Cross training' else [x for x in n.body_rich if isinstance(x, Table)]
)
cross_table = TypedTable(table)
def maybe(f):
def parse(s):
if len(s) == 0:
return None
return f(s)
return parse
def parse_mm_ss(x: str) -> timedelta:
hs, ms = x.split(':')
return timedelta(seconds=int(hs) * 60 + int(ms))
# todo eh. not sure if there is a way of getting around writing code...
# I guess would be nice to have a means of specifying type in the column? maybe multirow column names??
# need to look up org-mode standard..
mappers = {
'duration': lambda s: parse_mm_ss(s),
'date' : lambda s: tzify(parse_org_datetime(s)),
'comment' : str,
}
for row in cross_table.as_dicts:
# todo make more defensive, fallback on nan for individual fields??
try:
d = {}
for k, v in row.items():
# todo have something smarter... e.g. allow pandas to infer the type??
mapper = mappers.get(k, maybe(float))
d[k] = mapper(v) # type: ignore[operator]
yield d
except Exception as e:
# todo add parsing context
yield {'error': str(e)}
# todo hmm, converting an org table directly to pandas kinda makes sense?
# could have a '.dataframe' method in orgparse, optional dependency
@cdf
def cross_trainer_manual_dataframe() -> DataFrameT:
'''
Only manual org-mode entries
'''
import pandas as pd
df = pd.DataFrame(cross_trainer_data())
return df
# this should be enough?..
_DELTA = timedelta(hours=10)
# todo check error handling by introducing typos (e.g. especially dates) in org-mode
@cdf
def dataframe() -> DataFrameT:
'''
Attaches manually logged data (which Endomondo can't capture) and attaches it to Endomondo
'''
import pandas as pd
from ...endomondo import dataframe as EDF
edf = EDF()
edf = edf[edf['sport'].str.contains('Cross training')]
mdf = cross_trainer_manual_dataframe()
# TODO shit. need to always remember to split errors???
# on the other hand, dfs are always untyped. so it's not too bad??
# now for each manual entry, find a 'close enough' endomondo entry
# ideally it's a 1-1 (or 0-1) relationship, but there might be errors
rows = []
idxs = [] # type: ignore[var-annotated]
NO_ENDOMONDO = 'no endomondo matches'
for _i, row in mdf.iterrows():
rd = row.to_dict()
mdate = row['date']
if pd.isna(mdate):
# todo error handling got to be easier. seriously, mypy friendly dataframes would be amazing
idxs.append(None)
rows.append(rd) # presumably has an error set
continue
idx: int | None
close = edf[edf['start_time'].apply(lambda t: pd_date_diff(t, mdate)).abs() < _DELTA]
if len(close) == 0:
idx = None
d = {
**rd,
'error': NO_ENDOMONDO,
}
elif len(close) > 1:
idx = None
d = {
**rd,
'error': f'one manual, many endomondo: {close}',
}
else:
idx = close.index[0]
d = rd
if idx in idxs:
# todo might be a good idea to remove the original match as well?
idx = None
d = {
**rd,
'error': 'one endomondo, many manual',
}
idxs.append(idx)
rows.append(d)
mdf = pd.DataFrame(rows, index=idxs)
# todo careful about 'how'? we need it to preserve the errors
# maybe pd.merge is better suited for this??
df = edf.join(mdf, how='outer', rsuffix='_manual')
# todo reindex? so we don't have Nan leftovers
# todo set date anyway? maybe just squeeze into the index??
noendo = df['error'] == NO_ENDOMONDO
# meh. otherwise the column type ends up object
tz = df[noendo]['start_time'].dtype.tz
df.loc[noendo, 'start_time' ] = df[noendo]['date' ].dt.tz_convert(tz)
df.loc[noendo, 'duration' ] = df[noendo]['duration_manual']
df.loc[noendo, 'heart_rate_avg'] = df[noendo]['hr_avg' ]
# todo set sport?? set source?
return df
# TODO arbitrate kcal, duration, avg hr
# compare power and hr? add 'quality' function??
# TODO wtf?? where is speed coming from??
from ...core import Stats, stat
def stats() -> Stats:
return stat(cross_trainer_data)
def compare_manual() -> None:
df = dataframe()
df = df.set_index('start_time')
df = df[[
'kcal' , 'kcal_manual',
'duration', 'duration_manual',
]].dropna()
print(df.to_string())
def pd_date_diff(a, b) -> timedelta:
# ugh. pandas complains when we subtract timestamps in different timezones
assert a.tzinfo is not None, a
assert b.tzinfo is not None, b
return a.to_pydatetime() - b.to_pydatetime()

41
my/body/sleep/common.py Normal file
View file

@ -0,0 +1,41 @@
from ...core import Stats, stat
from ...core.pandas import DataFrameT
from ...core.pandas import check_dataframe as cdf
class Combine:
def __init__(self, modules) -> None:
self.modules = modules
@cdf
def dataframe(self, *, with_temperature: bool=True) -> DataFrameT:
import pandas as pd
# todo include 'source'?
df = pd.concat([m.dataframe() for m in self.modules])
if with_temperature:
from ... import bluemaestro as BM
bdf = BM.dataframe()
temp = bdf['temp']
# sort index and drop nans, otherwise indexing with [start: end] gonna complain
temp = pd.Series(
temp.values,
index=pd.to_datetime(temp.index, utc=True)
).sort_index()
temp = temp.loc[temp.index.dropna()]
def calc_avg_temperature(row):
start = row['sleep_start']
end = row['sleep_end']
if pd.isna(start) or pd.isna(end):
return None
# on no temp data, returns nan, ok
return temp[start: end].mean()
df['avg_temp'] = df.apply(calc_avg_temperature, axis=1)
return df
def stats(self) -> Stats:
return stat(self.dataframe)

10
my/body/sleep/main.py Normal file
View file

@ -0,0 +1,10 @@
from ... import emfit, jawbone
from .common import Combine
_combined = Combine([
jawbone,
emfit,
])
dataframe = _combined.dataframe
stats = _combined.stats

View file

@ -2,20 +2,29 @@
Weight data (manually logged) Weight data (manually logged)
''' '''
from collections.abc import Iterator
from dataclasses import dataclass
from datetime import datetime from datetime import datetime
from typing import NamedTuple, Iterator from typing import Any
from ..common import LazyLogger from my import orgmode
from ..error import Res from my.core import make_logger
from ..notes import orgmode from my.core.error import Res, extract_error_datetime, set_error_datetime
from my.config import weight as config config = Any
log = LazyLogger('my.body.weight') def make_config() -> config:
from my.config import weight as user_config # type: ignore[attr-defined]
return user_config()
class Entry(NamedTuple): log = make_logger(__name__)
@dataclass
class Entry:
dt: datetime dt: datetime
value: float value: float
# TODO comment?? # TODO comment??
@ -24,26 +33,31 @@ class Entry(NamedTuple):
Result = Res[Entry] Result = Res[Entry]
# TODO cachew?
def from_orgmode() -> Iterator[Result]: def from_orgmode() -> Iterator[Result]:
cfg = make_config()
orgs = orgmode.query() orgs = orgmode.query()
for o in orgs.query_all(lambda o: o.with_tag('weight')): for o in orgmode.query().all():
if 'weight' not in o.tags:
continue
try: try:
# TODO ?? Result type? # TODO can it throw? not sure
created = o.created created = o.created
heading = o.heading assert created is not None
except Exception as e: except Exception as e:
log.exception(e) log.exception(e)
yield e yield e
continue continue
try: try:
w = float(heading) w = float(o.heading)
except ValueError as e: except Exception as e:
set_error_datetime(e, dt=created)
log.exception(e) log.exception(e)
yield e yield e
continue continue
# TODO not sure if it's really necessary.. # FIXME use timezone provider
created = config.default_timezone.localize(created) created = cfg.default_timezone.localize(created)
assert created is not None # ??? somehow mypy wasn't happy?
yield Entry( yield Entry(
dt=created, dt=created,
value=w, value=w,
@ -51,21 +65,35 @@ def from_orgmode() -> Iterator[Result]:
) )
def dataframe(): def make_dataframe(data: Iterator[Result]):
import pandas as pd # type: ignore import pandas as pd
entries = from_orgmode()
def it(): def it():
for e in from_orgmode(): for e in data:
if isinstance(e, Exception): if isinstance(e, Exception):
dt = extract_error_datetime(e)
yield { yield {
'dt': dt,
'error': str(e), 'error': str(e),
} }
else: else:
yield { yield {
'dt' : e.dt, 'dt': e.dt,
'weight': e.value, 'weight': e.value,
} }
df = pd.DataFrame(it()) df = pd.DataFrame(it())
df.set_index('dt', inplace=True) df = df.set_index('dt')
# TODO not sure about UTC??
df.index = pd.to_datetime(df.index, utc=True) df.index = pd.to_datetime(df.index, utc=True)
return df return df
def dataframe():
entries = from_orgmode()
return make_dataframe(entries)
# TODO move to a submodule? e.g. my.body.weight.orgmode?
# so there could be more sources
# not sure about my.body thing though

View file

@ -1,47 +1,6 @@
""" from my.core import warnings
Kobo e-ink reader: annotations and reading stats
"""
from .. import init
from typing import Callable, Union, List warnings.high('my.books.kobo is deprecated! Please use my.kobo instead!')
from my.config import kobo as config from my.core.util import __NOT_HPI_MODULE__
from my.config.repos.kobuddy.src.kobuddy import * from my.kobo import *
# hmm, explicit imports make pylint a bit happier..
from my.config.repos.kobuddy.src.kobuddy import Highlight, set_databases, get_highlights
set_databases(config.export_dir)
# TODO maybe type over T?
_Predicate = Callable[[str], bool]
Predicatish = Union[str, _Predicate]
def from_predicatish(p: Predicatish) -> _Predicate:
if isinstance(p, str):
def ff(s):
return s == p
return ff
else:
return p
def by_annotation(predicatish: Predicatish, **kwargs) -> List[Highlight]:
pred = from_predicatish(predicatish)
res: List[Highlight] = []
for h in get_highlights(**kwargs):
if pred(h.annotation):
res.append(h)
return res
def get_todos():
def with_todo(ann):
if ann is None:
ann = ''
return 'todo' in ann.lower().split()
return by_annotation(with_todo)
def test_todos():
todos = get_todos()
assert len(todos) > 3

View file

@ -0,0 +1,54 @@
"""
Parses active browser history by backing it up with [[http://github.com/purarue/sqlite_backup][sqlite_backup]]
"""
REQUIRES = ["browserexport", "sqlite_backup"]
from dataclasses import dataclass
from my.config import browser as user_config
from my.core import Paths
@dataclass
class config(user_config.active_browser):
# paths to sqlite database files which you use actively
# to read from. For example:
# from browserexport.browsers.all import Firefox
# export_path = Firefox.locate_database()
export_path: Paths
from collections.abc import Iterator, Sequence
from pathlib import Path
from browserexport.merge import Visit, read_visits
from sqlite_backup import sqlite_backup
from my.core import Stats, get_files, make_logger
logger = make_logger(__name__)
from .common import _patch_browserexport_logs
_patch_browserexport_logs(logger.level)
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
def history() -> Iterator[Visit]:
for ad in inputs():
conn = sqlite_backup(ad)
assert conn is not None
try:
yield from read_visits(conn)
finally:
conn.close()
def stats() -> Stats:
from my.core import stat
return {**stat(history)}

35
my/browser/all.py Normal file
View file

@ -0,0 +1,35 @@
from collections.abc import Iterator
from browserexport.merge import Visit, merge_visits
from my.core import Stats
from my.core.source import import_source
src_export = import_source(module_name="my.browser.export")
src_active = import_source(module_name="my.browser.active_browser")
@src_export
def _visits_export() -> Iterator[Visit]:
from . import export
return export.history()
@src_active
def _visits_active() -> Iterator[Visit]:
from . import active_browser
return active_browser.history()
# NOTE: you can comment out the sources you don't need
def history() -> Iterator[Visit]:
yield from merge_visits([
_visits_active(),
_visits_export(),
])
def stats() -> Stats:
from my.core import stat
return {**stat(history)}

8
my/browser/common.py Normal file
View file

@ -0,0 +1,8 @@
from my.core.util import __NOT_HPI_MODULE__
def _patch_browserexport_logs(level: int):
# grab the computed level (respects LOGGING_LEVEL_ prefixes) and set it on the browserexport logger
from browserexport.log import setup as setup_browserexport_logger
setup_browserexport_logger(level)

48
my/browser/export.py Normal file
View file

@ -0,0 +1,48 @@
"""
Parses browser history using [[http://github.com/purarue/browserexport][browserexport]]
"""
REQUIRES = ["browserexport"]
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from pathlib import Path
from browserexport.merge import Visit, read_and_merge
from my.core import (
Paths,
Stats,
get_files,
make_logger,
stat,
)
from my.core.cachew import mcachew
from .common import _patch_browserexport_logs
import my.config # isort: skip
@dataclass
class config(my.config.browser.export):
# path[s]/glob to your backed up browser history sqlite files
export_path: Paths
logger = make_logger(__name__)
_patch_browserexport_logs(logger.level)
# all of my backed up databases
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
@mcachew(depends_on=inputs, logger=logger)
def history() -> Iterator[Visit]:
yield from read_and_merge(inputs())
def stats() -> Stats:
return {**stat(history)}

157
my/bumble/android.py Normal file
View file

@ -0,0 +1,157 @@
"""
Bumble data from Android app database (in =/data/data/com.bumble.app/databases/ChatComDatabase=)
"""
from __future__ import annotations
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
from more_itertools import unique_everseen
from my.core import Paths, get_files
from my.config import bumble as user_config # isort: skip
@dataclass
class config(user_config.android):
# paths[s]/glob to the exported sqlite databases
export_path: Paths
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
@dataclass(unsafe_hash=True)
class Person:
user_id: str
user_name: str
# todo not sure about order of fields...
@dataclass
class _BaseMessage:
id: str
created: datetime
is_incoming: bool
text: str
@dataclass(unsafe_hash=True)
class _Message(_BaseMessage):
conversation_id: str
reply_to_id: str | None
@dataclass(unsafe_hash=True)
class Message(_BaseMessage):
person: Person
reply_to: Message | None
import json
import sqlite3
from typing import Union
from my.core.compat import assert_never
from ..core import Res
from ..core.sqlite import select, sqlite_connect_immutable
EntitiesRes = Res[Union[Person, _Message]]
def _entities() -> Iterator[EntitiesRes]:
for db_file in inputs():
with sqlite_connect_immutable(db_file) as db:
yield from _handle_db(db)
def _handle_db(db: sqlite3.Connection) -> Iterator[EntitiesRes]:
# todo hmm not sure
# on the one hand kinda nice to use dataset..
# on the other, it's somewhat of a complication, and
# would be nice to have something type-directed for sql queries though
# e.g. with typeddict or something, so the number of parameter to the sql query matches?
for (user_id, user_name) in select(
('user_id', 'user_name'),
'FROM conversation_info',
db=db,
):
yield Person(
user_id=user_id,
user_name=user_name,
)
# note: has sender_name, but it's always None
for ( id, conversation_id , created , is_incoming , payload_type , payload , reply_to_id) in select(
('id', 'conversation_id', 'created_timestamp', 'is_incoming', 'payload_type', 'payload', 'reply_to_id'),
'FROM message ORDER BY created_timestamp',
db=db
):
try:
key = {'TEXT': 'text', 'QUESTION_GAME': 'text', 'IMAGE': 'url', 'GIF': 'url', 'AUDIO': 'url', 'VIDEO': 'url'}[payload_type]
text = json.loads(payload)[key]
yield _Message(
id=id,
# TODO not sure if utc??
created=datetime.fromtimestamp(created / 1000),
is_incoming=bool(is_incoming),
text=text,
conversation_id=conversation_id,
reply_to_id=reply_to_id,
)
except Exception as e:
yield e
def _key(r: EntitiesRes):
if isinstance(r, _Message):
if '/hidden?' in r.text:
# ugh. seems that image URLs change all the time in the db?
# can't access them without login anyway
# so use a different key for such messages
# todo maybe normalize text instead? since it's gonna always trigger diffs down the line
return (r.id, r.created)
return r
_UNKNOWN_PERSON = "UNKNOWN_PERSON"
def messages() -> Iterator[Res[Message]]:
id2person: dict[str, Person] = {}
id2msg: dict[str, Message] = {}
for x in unique_everseen(_entities(), key=_key):
if isinstance(x, Exception):
yield x
continue
if isinstance(x, Person):
id2person[x.user_id] = x
continue
if isinstance(x, _Message):
reply_to_id = x.reply_to_id
# hmm seems that sometimes there are messages with no corresponding conversation_info?
# possibly if user never clicked on conversation before..
person = id2person.get(x.conversation_id)
if person is None:
person = Person(user_id=x.conversation_id, user_name=_UNKNOWN_PERSON)
try:
reply_to = None if reply_to_id is None else id2msg[reply_to_id]
except Exception as e:
yield e
continue
m = Message(
id=x.id,
created=x.created,
is_incoming=x.is_incoming,
text=x.text,
person=person,
reply_to=reply_to,
)
id2msg[m.id] = m
yield m
continue
assert_never(x)

View file

@ -1,91 +1,60 @@
""" """
Provides data on days off work (based on public holidays + manual inputs) Holidays and days off work
""" """
REQUIRES = [
'workalendar', # library to determine public holidays
]
from functools import lru_cache
from datetime import date, datetime, timedelta from datetime import date, datetime, timedelta
import re from functools import lru_cache
from typing import Tuple, Iterator, List, Union from typing import Union
from my.core import Stats
from my.core.time import zone_to_countrycode
from my.config.holidays_data import HOLIDAYS_DATA @lru_cache(1)
def _calendar():
from workalendar.registry import registry # type: ignore
# todo switch to using time.tz.main once _get_tz stabilizes?
from ..time.tz import via_location as LTZ
# TODO would be nice to do it dynamically depending on the past timezones...
tz = LTZ.get_tz(datetime.now())
assert tz is not None
zone = tz.zone; assert zone is not None
code = zone_to_countrycode(zone)
Cal = registry.get_calendars()[code]
return Cal()
# pip3 install workalendar # todo move to common?
from workalendar.europe import UnitedKingdom # type: ignore DateIsh = Union[datetime, date, str]
cal = UnitedKingdom() # TODO FIXME specify in config def as_date(dd: DateIsh) -> date:
# TODO that should depend on country/'location' of residence I suppose?
Dateish = Union[datetime, date, str]
def as_date(dd: Dateish) -> date:
if isinstance(dd, datetime): if isinstance(dd, datetime):
return dd.date() return dd.date()
elif isinstance(dd, date): elif isinstance(dd, date):
return dd return dd
else: else:
# todo parse isoformat??
return as_date(datetime.strptime(dd, '%Y%m%d')) return as_date(datetime.strptime(dd, '%Y%m%d'))
@lru_cache(1) def is_holiday(d: DateIsh) -> bool:
def get_days_off_work() -> List[date]:
return list(iter_days_off_work())
def is_day_off_work(d: date) -> bool:
return d in get_days_off_work()
def is_working_day(d: Dateish) -> bool:
day = as_date(d) day = as_date(d)
if not cal.is_working_day(day): return not _calendar().is_working_day(day)
# public holiday -- def holiday
return False
# otherwise rely on work data
return not is_day_off_work(day)
def is_holiday(d: Dateish) -> bool: def is_workday(d: DateIsh) -> bool:
return not(is_working_day(d)) return not is_holiday(d)
def _iter_work_data() -> Iterator[Tuple[date, int]]: def stats() -> Stats:
emitted = 0 # meh, but not sure what would be a better test?
for x in HOLIDAYS_DATA.splitlines(): res = {}
m = re.search(r'(\d\d/\d\d/\d\d\d\d)(.*)-(\d+.\d+) days \d+.\d+ days', x) year = datetime.now().year
if m is None: jan1 = date(year=year, month=1, day=1)
continue for x in range(-7, 20):
(ds, cmnt, dayss) = m.groups() d = jan1 + timedelta(days=x)
if 'carry over' in cmnt: h = is_holiday(d)
continue res[d.isoformat()] = 'holiday' if h else 'workday'
return res
d = datetime.strptime(ds, '%d/%m/%Y').date()
dd, u = dayss.split('.')
assert u == '00' # TODO meh
yield d, int(dd)
emitted += 1
assert emitted > 5 # arbitrary, just a sanity check.. (TODO move to tests?)
def iter_days_off_work() -> Iterator[date]:
for d, span in _iter_work_data():
dd = d
while span > 0:
# only count it if it wasnt' a public holiday/weekend already
if cal.is_working_day(dd):
yield dd
span -= 1
dd += timedelta(days=1)
def test():
assert is_holiday('20190101')
assert not is_holiday('20180601')
if __name__ == '__main__':
for d in iter_days_off_work():
print(d, ' | ', d.strftime('%d %b'))

View file

@ -1,19 +1,7 @@
"""
A helper to allow configuring the modules dynamically.
Usage:
from my.cfg import config
After that, you can set config attributes:
from types import SimpleNamespace
config.twitter = SimpleNamespace(
export_path='/path/to/twitter/exports',
)
"""
# TODO later, If I have config stubs that might be unnecessary too..
from . import init
import my.config as config import my.config as config
from .core import __NOT_HPI_MODULE__
from .core import warnings as W
# still used in Promnesia, maybe in dashboard?
W.high("DEPRECATED! Please import my.config directly instead.")

78
my/codeforces.py Normal file
View file

@ -0,0 +1,78 @@
import json
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from datetime import datetime, timezone
from functools import cached_property
from pathlib import Path
from my.config import codeforces as config # type: ignore[attr-defined]
from my.core import Res, datetime_aware, get_files
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
ContestId = int
@dataclass
class Contest:
contest_id: ContestId
when: datetime_aware
name: str
@dataclass
class Competition:
contest: Contest
old_rating: int
new_rating: int
@cached_property
def when(self) -> datetime_aware:
return self.contest.when
# todo not sure if parser is the best name? hmm
class Parser:
def __init__(self, *, inputs: Sequence[Path]) -> None:
self.inputs = inputs
self.contests: dict[ContestId, Contest] = {}
def _parse_allcontests(self, p: Path) -> Iterator[Contest]:
j = json.loads(p.read_text())
for c in j['result']:
yield Contest(
contest_id=c['id'],
when=datetime.fromtimestamp(c['startTimeSeconds'], tz=timezone.utc),
name=c['name'],
)
def _parse_competitions(self, p: Path) -> Iterator[Competition]:
j = json.loads(p.read_text())
for c in j['result']:
contest_id = c['contestId']
contest = self.contests[contest_id]
yield Competition(
contest=contest,
old_rating=c['oldRating'],
new_rating=c['newRating'],
)
def parse(self) -> Iterator[Res[Competition]]:
for path in inputs():
if 'allcontests' in path.name:
# these contain information about all CF contests along with useful metadata
for contest in self._parse_allcontests(path):
# TODO some method to assert on mismatch if it exists? not sure
self.contests[contest.contest_id] = contest
elif 'codeforces' in path.name:
# these contain only contests the user participated in
yield from self._parse_competitions(path)
else:
raise RuntimeError(f"shouldn't happen: {path.name}")
def data() -> Iterator[Res[Competition]]:
return Parser(inputs=inputs()).parse()

View file

@ -1,116 +0,0 @@
#!/usr/bin/env python3
from .. import init
from my.config import codeforces as config
from datetime import datetime
from typing import NamedTuple
from pathlib import Path
import json
from typing import Dict, Iterator, Any
from ..common import cproperty, get_files
from ..error import Res, unwrap
from kython import fget
from kython.konsume import zoom, ignore, wrap
# TODO remove
from kython.kdatetime import as_utc
Cid = int
class Contest(NamedTuple):
cid: Cid
when: datetime
@classmethod
def make(cls, j) -> 'Contest':
return cls(
cid=j['id'],
when=as_utc(j['startTimeSeconds']),
)
Cmap = Dict[Cid, Contest]
def get_contests() -> Cmap:
last = max(get_files(config.export_path, 'allcontests*.json'))
j = json.loads(last.read_text())
d = {}
for c in j['result']:
cc = Contest.make(c)
d[cc.cid] = cc
return d
class Competition(NamedTuple):
contest_id: Cid
contest: str
cmap: Cmap
@cproperty
def uid(self) -> Cid:
return self.contest_id
def __hash__(self):
return hash(self.contest_id)
@cproperty
def when(self) -> datetime:
return self.cmap[self.uid].when
@cproperty
def summary(self) -> str:
return f'participated in {self.contest}' # TODO
@classmethod
def make(cls, cmap, json) -> Iterator[Res['Competition']]:
# TODO try here??
contest_id = json['contestId'].zoom().value
contest = json['contestName'].zoom().value
yield cls(
contest_id=contest_id,
contest=contest,
cmap=cmap,
)
# TODO ytry???
ignore(json, 'rank', 'oldRating', 'newRating')
def iter_data() -> Iterator[Res[Competition]]:
cmap = get_contests()
last = max(get_files(config.export_path, 'codeforces*.json'))
with wrap(json.loads(last.read_text())) as j:
j['status'].ignore()
res = j['result'].zoom()
for c in list(res): # TODO maybe we want 'iter' method??
ignore(c, 'handle', 'ratingUpdateTimeSeconds')
yield from Competition.make(cmap=cmap, json=c)
c.consume()
# TODO maybe if they are all empty, no need to consume??
def get_data():
return list(sorted(iter_data(), key=fget(Competition.when)))
def test():
assert len(get_data()) > 10
def main():
for d in iter_data():
try:
d = unwrap(d)
except Exception as e:
print(f'ERROR! {d}')
else:
print(f'{d.when}: {d.summary}')
if __name__ == '__main__':
main()

View file

@ -1,62 +1,88 @@
""" """
Git commits data: crawls filesystem Git commits data for repositories on your filesystem
""" """
from pathlib import Path from __future__ import annotations
REQUIRES = [
'gitpython',
]
import shutil
from collections.abc import Iterator, Sequence
from dataclasses import dataclass, field
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import List, NamedTuple, Optional, Dict, Any, Iterator, Set from pathlib import Path
from typing import Optional, cast
from ..common import PathIsh, LazyLogger, mcachew from my.core import LazyLogger, PathIsh, make_config
from my.config import commits as config from my.core.cachew import cache_dir, mcachew
from my.core.warnings import high
# pip3 install gitpython from my.config import commits as user_config # isort: skip
import git # type: ignore
from git.repo.fun import is_git_dir, find_worktree_git_dir # type: ignore
log = LazyLogger('my.commits', level='info') @dataclass
class commits_cfg(user_config):
roots: Sequence[PathIsh] = field(default_factory=list)
emails: Sequence[str] | None = None
names: Sequence[str] | None = None
_things = { # experiment to make it lazy?
*config.emails, # would be nice to have a nicer syntax for it... maybe make_config could return a 'lazy' object
*config.names, def config() -> commits_cfg:
} res = make_config(commits_cfg)
if res.emails is None and res.names is None:
# todo error policy? throw/warn/ignore
high("Set either 'emails' or 'names', otherwise you'll get no commits")
return res
##########################
import git
from git.repo.fun import is_git_dir
log = LazyLogger(__name__, level='info')
def by_me(c) -> bool: def by_me(c: git.objects.commit.Commit) -> bool:
actor = c.author actor = c.author
if actor.email in config.emails: if actor.email in (config().emails or ()):
return True return True
if actor.name in config.names: if actor.name in (config().names or ()):
return True return True
aa = f"{actor.email} {actor.name}"
for thing in _things:
if thing in aa:
# TODO this is probably useless
raise RuntimeError("WARNING!!!", actor, c, c.repo)
return False return False
class Commit(NamedTuple): @dataclass
commited_dt: datetime class Commit:
committed_dt: datetime
authored_dt: datetime authored_dt: datetime
message: str message: str
repo: str # TODO put canonical name here straightaway?? repo: str # TODO put canonical name here straight away??
sha: str sha: str
ref: Optional[str]=None ref: Optional[str] = None
# TODO filter so they are authored by me # TODO filter so they are authored by me
@property @property
def dt(self) -> datetime: def dt(self) -> datetime:
return self.commited_dt return self.committed_dt
# for backwards compatibility, was misspelled previously
@property
def commited_dt(self) -> datetime:
high("DEPRECATED! Please replace 'commited_dt' with 'committed_dt' (two 't's instead of one)")
return self.committed_dt
# TODO not sure, maybe a better idea to move it to timeline? # TODO not sure, maybe a better idea to move it to timeline?
def fix_datetime(dt) -> datetime: def fix_datetime(dt: datetime) -> datetime:
# git module got it's own tzinfo object.. and it's pretty weird # git module got it's own tzinfo object.. and it's pretty weird
tz = dt.tzinfo tz = dt.tzinfo
assert tz._name == 'fixed' assert tz is not None, dt
offset = tz._offset assert getattr(tz, '_name') == 'fixed'
offset = getattr(tz, '_offset')
ntz = timezone(offset) ntz = timezone(offset)
return dt.replace(tzinfo=ntz) return dt.replace(tzinfo=ntz)
@ -69,7 +95,7 @@ def _git_root(git_dir: PathIsh) -> Path:
return gd # must be bare return gd # must be bare
def _repo_commits_aux(gr: git.Repo, rev: str, emitted: Set[str]) -> Iterator[Commit]: def _repo_commits_aux(gr: git.Repo, rev: str, emitted: set[str]) -> Iterator[Commit]:
# without path might not handle pull heads properly # without path might not handle pull heads properly
for c in gr.iter_commits(rev=rev): for c in gr.iter_commits(rev=rev):
if not by_me(c): if not by_me(c):
@ -79,12 +105,15 @@ def _repo_commits_aux(gr: git.Repo, rev: str, emitted: Set[str]) -> Iterator[Com
continue continue
emitted.add(sha) emitted.add(sha)
repo = str(_git_root(gr.git_dir)) # todo figure out how to handle Union[str, PathLike[Any]].. should it be part of PathIsh?
repo = str(_git_root(gr.git_dir)) # type: ignore[arg-type]
yield Commit( yield Commit(
commited_dt=fix_datetime(c.committed_datetime), committed_dt=fix_datetime(c.committed_datetime),
authored_dt=fix_datetime(c.authored_datetime), authored_dt=fix_datetime(c.authored_datetime),
message=c.message.strip(), # hmm no idea why is it typed with Union[str, bytes]??
# https://github.com/gitpython-developers/GitPython/blob/1746b971387eccfc6fb4e34d3c334079bbb14b2e/git/objects/commit.py#L214
message=cast(str, c.message).strip(),
repo=repo, repo=repo,
sha=sha, sha=sha,
ref=rev, ref=rev,
@ -93,7 +122,7 @@ def _repo_commits_aux(gr: git.Repo, rev: str, emitted: Set[str]) -> Iterator[Com
def repo_commits(repo: PathIsh): def repo_commits(repo: PathIsh):
gr = git.Repo(str(repo)) gr = git.Repo(str(repo))
emitted: Set[str] = set() emitted: set[str] = set()
for r in gr.references: for r in gr.references:
yield from _repo_commits_aux(gr=gr, rev=r.path, emitted=emitted) yield from _repo_commits_aux(gr=gr, rev=r.path, emitted=emitted)
@ -109,73 +138,84 @@ def canonical_name(repo: Path) -> str:
# else: # else:
# rname = r.name # rname = r.name
# if 'backups/github' in repo: # if 'backups/github' in repo:
# pass # TODO # pass # TODO
# TODO could reuse in clustergit?.. def _fd_path() -> str:
def git_repos_in(roots: List[Path]) -> List[Path]: # todo move it to core
fd_path: str | None = shutil.which("fdfind") or shutil.which("fd-find") or shutil.which("fd")
if fd_path is None:
high("my.coding.commits requires 'fd' to be installed, See https://github.com/sharkdp/fd#installation")
assert fd_path is not None
return fd_path
def git_repos_in(roots: list[Path]) -> list[Path]:
from subprocess import check_output from subprocess import check_output
outputs = check_output([ outputs = check_output([
'fdfind', _fd_path(),
# '--follow', # right, not so sure about follow... make configurable? # '--follow', # right, not so sure about follow... make configurable?
'--hidden', '--hidden',
'--no-ignore', # otherwise doesn't go inside .git directory (from fd v9)
'--full-path', '--full-path',
'--type', 'f', '--type', 'f',
'/HEAD', # judging by is_git_dir, it should always be here.. '/HEAD', # judging by is_git_dir, it should always be here..
*roots, *roots,
]).decode('utf8').splitlines() ]).decode('utf8').splitlines()
candidates = set(Path(o).resolve().absolute().parent for o in outputs)
candidates = {Path(o).resolve().absolute().parent for o in outputs}
# exclude stuff within .git dirs (can happen for submodules?) # exclude stuff within .git dirs (can happen for submodules?)
candidates = {c for c in candidates if '.git' not in c.parts[:-1]} candidates = {c for c in candidates if '.git' not in c.parts[:-1]}
candidates = {c for c in candidates if is_git_dir(c)} candidates = {c for c in candidates if is_git_dir(c)}
repos = list(sorted(map(_git_root, candidates))) repos = sorted(map(_git_root, candidates))
return repos return repos
def repos(): def repos() -> list[Path]:
return git_repos_in(config.roots) return git_repos_in(list(map(Path, config().roots)))
def _hashf(_repos: List[Path]): # returns modification time for an index to use as hash function
# TODO maybe use smth from git library? ugh.. def _repo_depends_on(_repo: Path) -> int:
res = [] for pp in [
".git/FETCH_HEAD",
".git/HEAD",
"FETCH_HEAD", # bare
"HEAD", # bare
]:
ff = _repo / pp
if ff.exists():
return int(ff.stat().st_mtime)
raise RuntimeError(f"Could not find a FETCH_HEAD/HEAD file in {_repo}")
def _commits(_repos: list[Path]) -> Iterator[Commit]:
for r in _repos: for r in _repos:
# TODO just use anything except index? ugh. yield from _cached_commits(r)
for pp in {
'.git/FETCH_HEAD',
'.git/HEAD',
'FETCH_HEAD', # bare
'HEAD', # bare
}:
ff = r / pp
if ff.exists():
updated = ff.stat().st_mtime
break
else:
raise RuntimeError(r)
res.append((r, updated))
return res
# TODO per-repo cache?
# TODO set default cache path? def _cached_commits_path(p: Path) -> str:
# TODO got similar issue as in photos with a helper method.. figure it out p = cache_dir() / 'my.coding.commits:_cached_commits' / str(p.absolute()).strip("/")
@mcachew(hashf=_hashf, logger=log) p.mkdir(parents=True, exist_ok=True)
def _commits(_repos) -> Iterator[Commit]: return str(p)
for r in _repos:
log.info('processing %s', r)
yield from repo_commits(r) # per-repo commits, to use cachew
@mcachew(
depends_on=_repo_depends_on,
logger=log,
cache_path=_cached_commits_path,
)
def _cached_commits(repo: Path) -> Iterator[Commit]:
log.debug('processing %s', repo)
yield from repo_commits(repo)
def commits() -> Iterator[Commit]: def commits() -> Iterator[Commit]:
return _commits(repos()) return _commits(repos())
def print_all():
for c in commits():
print(c)
# TODO enforce read only? although it doesn't touch index # TODO enforce read only? although it doesn't touch index

View file

@ -1,255 +1,12 @@
""" from typing import TYPE_CHECKING
Github events and their metadata: comments/issues/pull requests
"""
from .. import init from my.core import warnings
from typing import Dict, List, Union, Any, NamedTuple, Tuple, Optional, Iterator, TypeVar, Set warnings.high('my.coding.github is deprecated! Please use my.github.all instead!')
from datetime import datetime # todo why aren't DeprecationWarning shown by default??
import json
from pathlib import Path
import pytz if not TYPE_CHECKING:
from ..github.all import events, get_events # noqa: F401
from ..kython.klogging import LazyLogger # todo deprecate properly
from ..kython.kompress import CPath iter_events = events
from ..common import get_files, mcachew
from ..error import Res
from my.config import github as config
import my.config.repos.ghexport.dal as ghexport
logger = LazyLogger('my.github')
# TODO __package__???
class Event(NamedTuple):
dt: datetime
summary: str
eid: str
link: Optional[str]
body: Optional[str]=None
# TODO split further, title too
def _get_summary(e) -> Tuple[str, Optional[str], Optional[str]]:
tp = e['type']
pl = e['payload']
rname = e['repo']['name']
if tp == 'ForkEvent':
url = e['payload']['forkee']['html_url']
return f"forked {rname}", url, None
elif tp == 'PushEvent':
return f"pushed to {rname}", None, None
elif tp == 'WatchEvent':
return f"watching {rname}", None, None
elif tp == 'CreateEvent':
# TODO eh, only weird API link?
return f"created {rname}", None, f'created_{rname}'
elif tp == 'PullRequestEvent':
pr = pl['pull_request']
action = pl['action']
link = pr['html_url']
title = pr['title']
return f"{action} PR {title}", link, f'pull_request_{link}'
elif tp == "IssuesEvent":
action = pl['action']
iss = pl['issue']
link = iss['html_url']
title = iss['title']
return f"{action} issue {title}", link, None
elif tp == "IssueCommentEvent":
com = pl['comment']
link = com['html_url']
iss = pl['issue']
title = iss['title']
return f"commented on issue {title}", link, f'issue_comment_' + link
elif tp == "ReleaseEvent":
action = pl['action']
rel = pl['release']
tag = rel['tag_name']
link = rel['html_url']
return f"{action} {rname} [{tag}]", link, None
elif tp in (
"DeleteEvent",
"PublicEvent",
):
return tp, None, None # TODO ???
else:
return tp, None, None
def get_dal():
sources = get_files(config.export_dir, glob='*.json*')
sources = list(map(CPath, sources)) # TODO maybe move it to get_files? e.g. compressed=True arg?
return ghexport.DAL(sources)
def _parse_dt(s: str) -> datetime:
# TODO isoformat?
return pytz.utc.localize(datetime.strptime(s, '%Y-%m-%dT%H:%M:%SZ'))
# TODO extract to separate gdpr module?
# TODO typing.TypedDict could be handy here..
def _parse_common(d: Dict) -> Dict:
url = d['url']
body = d.get('body')
return {
'dt' : _parse_dt(d['created_at']),
'link': url,
'body': body,
}
def _parse_repository(d: Dict) -> Event:
pref = 'https://github.com/'
url = d['url']
assert url.startswith(pref); name = url[len(pref):]
return Event( # type: ignore[misc]
**_parse_common(d),
summary='created ' + name,
eid='created_' + name, # TODO ??
)
def _parse_issue_comment(d: Dict) -> Event:
url = d['url']
return Event( # type: ignore[misc]
**_parse_common(d),
summary=f'commented on issue {url}',
eid='issue_comment_' + url,
)
def _parse_issue(d: Dict) -> Event:
url = d['url']
title = d['title']
return Event( # type: ignore[misc]
**_parse_common(d),
summary=f'opened issue {title}',
eid='issue_comment_' + url,
)
def _parse_pull_request(d: Dict) -> Event:
url = d['url']
title = d['title']
return Event( # type: ignore[misc]
**_parse_common(d),
# TODO distinguish incoming/outgoing?
# TODO action? opened/closed??
summary=f'opened PR {title}',
eid='pull_request_' + url,
)
def _parse_release(d: Dict) -> Event:
tag = d['tag_name']
return Event( # type: ignore[misc]
**_parse_common(d),
summary=f'released {tag}',
eid='release_' + tag,
)
def _parse_commit_comment(d: Dict) -> Event:
url = d['url']
return Event( # type: ignore[misc]
**_parse_common(d),
summary=f'commented on {url}',
eid='commoit_comment_' + url,
)
def _parse_event(d: Dict) -> Event:
summary, link, eid = _get_summary(d)
if eid is None:
eid = d['id']
body = d.get('payload', {}).get('comment', {}).get('body')
return Event(
dt=_parse_dt(d['created_at']),
summary=summary,
link=link,
eid=eid,
body=body,
)
def iter_gdpr_events() -> Iterator[Res[Event]]:
"""
Parses events from GDPR export (https://github.com/settings/admin)
"""
files = list(sorted(config.gdpr_dir.glob('*.json')))
handler_map = {
'schema' : None,
'issue_events_': None, # eh, doesn't seem to have any useful bodies
'attachments_' : None, # not sure if useful
'users' : None, # just contains random users
'repositories_' : _parse_repository,
'issue_comments_': _parse_issue_comment,
'issues_' : _parse_issue,
'pull_requests_' : _parse_pull_request,
'releases_' : _parse_release,
'commit_comments': _parse_commit_comment,
}
for f in files:
handler: Any
for prefix, h in handler_map.items():
if not f.name.startswith(prefix):
continue
handler = h
break
else:
yield RuntimeError(f'Unhandled file: {f}')
continue
if handler is None:
# ignored
continue
j = json.loads(f.read_text())
for r in j:
try:
yield handler(r)
except Exception as e:
yield e
# TODO hmm. not good, need to be lazier?...
@mcachew(config.cache_dir, hashf=lambda dal: dal.sources)
def iter_backup_events(dal=get_dal()) -> Iterator[Event]:
for d in dal.events():
yield _parse_event(d)
def iter_events() -> Iterator[Res[Event]]:
from itertools import chain
emitted: Set[Tuple[datetime, str]] = set()
for e in chain(iter_gdpr_events(), iter_backup_events()):
if isinstance(e, Exception):
yield e
continue
key = (e.dt, e.eid) # use both just in case
# TODO wtf?? some minor (e.g. 1 sec) discrepancies (e.g. create repository events)
if key in emitted:
logger.debug('ignoring %s: %s', key, e)
continue
yield e
emitted.add(key)
def get_events():
return sorted(iter_events(), key=lambda e: e.dt)
# TODO mm. ok, not much point in deserializing as github.Event as it's basically a fancy dict wrapper?
# from github.Event import Event as GEvent # type: ignore
# # see https://github.com/PyGithub/PyGithub/blob/master/github/GithubObject.py::GithubObject.__init__
# e = GEvent(None, None, raw_event, True)
def test():
events = get_events()
assert len(events) > 100
for e in events:
print(e)

View file

@ -1,104 +0,0 @@
#!/usr/bin/env python3
from .. import init
from my.config import topcoder as config
from datetime import datetime
from typing import NamedTuple
from pathlib import Path
import json
from typing import Dict, Iterator, Any
from ..common import cproperty, get_files
from ..error import Res, unwrap
# TODO get rid of fget?
from kython import fget
from kython.konsume import zoom, wrap, ignore
# TODO json type??
def _get_latest() -> Dict:
pp = max(get_files(config.export_path, glob='*.json'))
return json.loads(pp.read_text())
class Competition(NamedTuple):
contest_id: str
contest: str
percentile: float
dates: str
@cproperty
def uid(self) -> str:
return self.contest_id
def __hash__(self):
return hash(self.contest_id)
@cproperty
def when(self) -> datetime:
return datetime.strptime(self.dates, '%Y-%m-%dT%H:%M:%S.%fZ')
@cproperty
def summary(self) -> str:
return f'participated in {self.contest}: {self.percentile:.0f}'
@classmethod
def make(cls, json) -> Iterator[Res['Competition']]:
ignore(json, 'rating', 'placement')
cid = json['challengeId'].zoom().value
cname = json['challengeName'].zoom().value
percentile = json['percentile'].zoom().value
dates = json['date'].zoom().value
yield cls(
contest_id=cid,
contest=cname,
percentile=percentile,
dates=dates,
)
def iter_data() -> Iterator[Res[Competition]]:
with wrap(_get_latest()) as j:
ignore(j, 'id', 'version')
res = j['result'].zoom()
ignore(res, 'success', 'status', 'metadata')
cont = res['content'].zoom()
ignore(cont, 'handle', 'handleLower', 'userId', 'createdAt', 'updatedAt', 'createdBy', 'updatedBy')
cont['DEVELOP'].ignore() # TODO handle it??
ds = cont['DATA_SCIENCE'].zoom()
mar, srm = zoom(ds, 'MARATHON_MATCH', 'SRM')
mar = mar['history'].zoom()
srm = srm['history'].zoom()
# TODO right, I guess I could rely on pylint for unused variables??
for c in mar + srm:
yield from Competition.make(json=c)
c.consume()
def get_data():
return list(sorted(iter_data(), key=fget(Competition.when)))
def test():
assert len(get_data()) > 10
def main():
for d in iter_data():
try:
d = unwrap(d)
except Exception as e:
print(f'ERROR! {d}')
else:
print(d.summary)
if __name__ == '__main__':
main()

View file

@ -1,170 +1,6 @@
from pathlib import Path from .core.warnings import high
import functools
import types
from typing import Union, Callable, Dict, Iterable, TypeVar, Sequence, List, Optional, Any, cast
from . import init high("DEPRECATED! Please use my.core.common instead.")
# some helper functions from .core import __NOT_HPI_MODULE__
PathIsh = Union[Path, str] from .core.common import *
# TODO port annotations to kython?..
def import_file(p: PathIsh, name: Optional[str]=None) -> types.ModuleType:
p = Path(p)
if name is None:
name = p.stem
import importlib.util
spec = importlib.util.spec_from_file_location(name, p)
foo = importlib.util.module_from_spec(spec)
loader = spec.loader; assert loader is not None
loader.exec_module(foo) # type: ignore[attr-defined]
return foo
def import_from(path: PathIsh, name: str) -> types.ModuleType:
path = str(path)
import sys
try:
sys.path.append(path)
import importlib
return importlib.import_module(name)
finally:
sys.path.remove(path)
T = TypeVar('T')
K = TypeVar('K')
V = TypeVar('V')
def the(l: Iterable[T]) -> T:
it = iter(l)
try:
first = next(it)
except StopIteration as ee:
raise RuntimeError('Empty iterator?')
assert all(e == first for e in it)
return first
def group_by_key(l: Iterable[T], key: Callable[[T], K]) -> Dict[K, List[T]]:
res: Dict[K, List[T]] = {}
for i in l:
kk = key(i)
lst = res.get(kk, [])
lst.append(i)
res[kk] = lst
return res
def _identity(v: T) -> V:
return cast(V, v)
def make_dict(l: Iterable[T], key: Callable[[T], K], value: Callable[[T], V]=_identity) -> Dict[K, V]:
res: Dict[K, V] = {}
for i in l:
k = key(i)
v = value(i)
pv = res.get(k, None) # type: ignore
if pv is not None:
raise RuntimeError(f"Duplicate key: {k}. Previous value: {pv}, new value: {v}")
res[k] = v
return res
Cl = TypeVar('Cl')
R = TypeVar('R')
def cproperty(f: Callable[[Cl], R]) -> R:
return property(functools.lru_cache(maxsize=1)(f)) # type: ignore
# https://stackoverflow.com/a/12377059/706389
def listify(fn=None, wrapper=list):
"""
Wraps a function's return value in wrapper (e.g. list)
Useful when an algorithm can be expressed more cleanly as a generator
"""
def listify_return(fn):
@functools.wraps(fn)
def listify_helper(*args, **kw):
return wrapper(fn(*args, **kw))
return listify_helper
if fn is None:
return listify_return
return listify_return(fn)
# TODO FIXME use in bluemaestro
# def dictify(fn=None, key=None, value=None):
# def md(it):
# return make_dict(it, key=key, value=value)
# return listify(fn=fn, wrapper=md)
from .kython.klogging import setup_logger, LazyLogger
Paths = Union[Sequence[PathIsh], PathIsh]
def get_files(pp: Paths, glob: str, sort: bool=True) -> List[Path]:
"""
Helper function to avoid boilerplate.
"""
# TODO FIXME mm, some wrapper to assert iterator isn't empty?
sources: List[Path] = []
if isinstance(pp, (str, Path)):
sources.append(Path(pp))
else:
sources.extend(map(Path, pp))
paths: List[Path] = []
for src in sources:
if src.is_dir():
gp: Iterable[Path] = src.glob(glob)
paths.extend(gp)
else:
assert src.is_file(), src
# TODO FIXME assert matches glob??
paths.append(src)
if sort:
paths = list(sorted(paths))
return paths
def mcachew(*args, **kwargs):
"""
Stands for 'Maybe cachew'.
Defensive wrapper around @cachew to make it an optional dependency.
"""
try:
import cachew
except ModuleNotFoundError:
import warnings
warnings.warn('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
return lambda orig_func: orig_func
else:
import cachew.experimental
cachew.experimental.enable_exceptions() # TODO do it only once?
return cachew.cachew(*args, **kwargs)
@functools.lru_cache(1)
def _magic():
import magic # type: ignore
return magic.Magic(mime=True)
# TODO could reuse in pdf module?
import mimetypes # TODO do I need init()?
def fastermime(path: str) -> str:
# mimetypes is faster
(mime, _) = mimetypes.guess_type(path)
if mime is not None:
return mime
# magic is slower but returns more stuff
# TODO FIXME Result type; it's inherently racey
return _magic().from_file(path)
Json = Dict[str, Any]

286
my/config.py Normal file
View file

@ -0,0 +1,286 @@
'''
NOTE: you shouldn't modify this file.
You probably want to edit your personal config (check via 'hpi config check' or create with 'hpi config create').
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-modules for info on creating your own config
This file is used for:
- documentation (as an example of the config structure)
- mypy: this file provides some type annotations
- for loading the actual user config
'''
from __future__ import annotations
#### NOTE: you won't need this line VVVV in your personal config
from my.core import init # noqa: F401 # isort: skip
###
from datetime import tzinfo
from pathlib import Path
from my.core import PathIsh, Paths
class hypothesis:
# expects outputs from https://github.com/karlicoss/hypexport
# (it's just the standard Hypothes.is export format)
export_path: Paths = r'/path/to/hypothesis/data'
class instapaper:
export_path: Paths = ''
class smscalls:
export_path: Paths = ''
class pocket:
export_path: Paths = ''
class github:
export_path: Paths = ''
gdpr_dir: Paths = ''
class reddit:
class rexport:
export_path: Paths = ''
class pushshift:
export_path: Paths = ''
class gdpr:
export_path: Paths = ''
class endomondo:
export_path: Paths = ''
class exercise:
workout_log: PathIsh = '/some/path.org'
class bluemaestro:
export_path: Paths = ''
class stackexchange:
export_path: Paths = ''
class goodreads:
export_path: Paths = ''
class pinboard:
export_dir: Paths = ''
class google:
class maps:
class android:
export_path: Paths = ''
takeout_path: Paths = ''
from collections.abc import Sequence
from datetime import date, datetime, timedelta
from typing import Union
DateIsh = Union[datetime, date, str]
LatLon = tuple[float, float]
class location:
# todo ugh, need to think about it... mypy wants the type here to be general, otherwise it can't deduce
# and we can't import the types from the module itself, otherwise would be circular. common module?
home: LatLon | Sequence[tuple[DateIsh, LatLon]] = (1.0, -1.0)
home_accuracy = 30_000.0
class via_ip:
accuracy: float
for_duration: timedelta
class gpslogger:
export_path: Paths = ''
accuracy: float
class google_takeout_semantic:
# a value between 0 and 100, 100 being the most confident
# set to 0 to include all locations
# https://locationhistoryformat.com/reference/semantic/#/$defs/placeVisit/properties/locationConfidence
require_confidence: float = 40
# default accuracy for semantic locations
accuracy: float = 100
from typing import Literal
class time:
class tz:
policy: Literal['keep', 'convert', 'throw']
class via_location:
fast: bool
sort_locations: bool
require_accuracy: float
class orgmode:
paths: Paths
class arbtt:
logfiles: Paths
class commits:
emails: Sequence[str] | None
names: Sequence[str] | None
roots: Sequence[PathIsh]
class pdfs:
paths: Paths
class zulip:
class organization:
export_path: Paths
class bumble:
class android:
export_path: Paths
class tinder:
class android:
export_path: Paths
class instagram:
class android:
export_path: Paths
username: str | None
full_name: str | None
class gdpr:
export_path: Paths
class hackernews:
class dogsheep:
export_path: Paths
class materialistic:
export_path: Paths
class fbmessenger:
class fbmessengerexport:
export_db: PathIsh
facebook_id: str | None
class android:
export_path: Paths
class twitter_archive:
export_path: Paths
class twitter:
class talon:
export_path: Paths
class android:
export_path: Paths
class twint:
export_path: Paths
class browser:
class export:
export_path: Paths = ''
class active_browser:
export_path: Paths = ''
class telegram:
class telegram_backup:
export_path: PathIsh = ''
class demo:
data_path: Paths
username: str
timezone: tzinfo
class simple:
count: int
class vk_messages_backup:
storage_path: Path
user_id: int
class kobo:
export_path: Paths
class feedly:
export_path: Paths
class feedbin:
export_path: Paths
class taplog:
export_path: Paths
class lastfm:
export_path: Paths
class rescuetime:
export_path: Paths
class runnerup:
export_path: Paths
class emfit:
export_path: Path
timezone: tzinfo
excluded_sids: list[str]
class foursquare:
export_path: Paths
class rtm:
export_path: Paths
class imdb:
export_path: Paths
class roamresearch:
export_path: Paths
username: str
class whatsapp:
class android:
export_path: Paths
my_user_id: str | None
class harmonic:
export_path: Paths
class monzo:
class monzoexport:
export_path: Paths

61
my/core/__init__.py Normal file
View file

@ -0,0 +1,61 @@
# this file only keeps the most common & critical types/utility functions
from typing import TYPE_CHECKING
from .cfg import make_config
from .common import PathIsh, Paths, get_files
from .compat import assert_never
from .error import Res, notnone, unwrap
from .logging import (
make_logger,
)
from .stats import Stats, stat
from .types import (
Json,
datetime_aware,
datetime_naive,
)
from .util import __NOT_HPI_MODULE__
from .utils.itertools import warn_if_empty
LazyLogger = make_logger # TODO deprecate this in favor of make_logger
if not TYPE_CHECKING:
# we used to keep these here for brevity, but feels like it only adds confusion,
# e.g. suggest that we perhaps somehow modify builtin behaviour or whatever
# so best to prefer explicit behaviour
from dataclasses import dataclass
from pathlib import Path
__all__ = [
'__NOT_HPI_MODULE__',
'Json',
'LazyLogger', # legacy import
'Path',
'PathIsh',
'Paths',
'Res',
'Stats',
'assert_never', # TODO maybe deprecate from use in my.core? will be in stdlib soon
'dataclass',
'datetime_aware',
'datetime_naive',
'get_files',
'make_config',
'make_logger',
'notnone',
'stat',
'unwrap',
'warn_if_empty',
]
## experimental for now
# you could put _init_hook.py next to your private my/config
# that way you can configure logging/warnings/env variables on every HPI import
try:
import my._init_hook # type: ignore[import-not-found] # noqa: F401
except:
pass
##

918
my/core/__main__.py Normal file
View file

@ -0,0 +1,918 @@
from __future__ import annotations
import functools
import importlib
import inspect
import os
import shlex
import shutil
import sys
import tempfile
import traceback
from collections.abc import Iterable, Sequence
from contextlib import ExitStack
from itertools import chain
from pathlib import Path
from subprocess import PIPE, CompletedProcess, Popen, check_call, run
from typing import Any, Callable
import click
@functools.lru_cache
def mypy_cmd() -> Sequence[str] | None:
try:
# preferably, use mypy from current python env
import mypy # noqa: F401 fine not to use it
except ImportError:
pass
else:
return [sys.executable, '-m', 'mypy']
# ok, not ideal but try from PATH
if shutil.which('mypy'):
return ['mypy']
warning("mypy not found, so can't check config with it. See https://github.com/python/mypy#readme if you want to install it and retry")
return None
def run_mypy(cfg_path: Path) -> CompletedProcess | None:
# todo dunno maybe use the same mypy config in repository?
# I'd need to install mypy.ini then??
env = {**os.environ}
mpath = env.get('MYPYPATH')
mpath = str(cfg_path) + ('' if mpath is None else f':{mpath}')
env['MYPYPATH'] = mpath
cmd = mypy_cmd()
if cmd is None:
return None
mres = run([ # noqa: UP022,PLW1510
*cmd,
'--namespace-packages',
'--color-output', # not sure if works??
'--pretty',
'--show-error-codes',
'--show-error-context',
'--check-untyped-defs',
'-p', 'my.config',
], stderr=PIPE, stdout=PIPE, env=env)
return mres
# use click.echo over print since it handles handles possible Unicode errors,
# strips colors if the output is a file
# https://click.palletsprojects.com/en/7.x/quickstart/#echoing
def eprint(x: str) -> None:
# err=True prints to stderr
click.echo(x, err=True)
def indent(x: str) -> str:
# todo use textwrap.indent?
return ''.join(' ' + l for l in x.splitlines(keepends=True))
OK = ''
OFF = '🔲'
def info(x: str) -> None:
eprint(OK + ' ' + x)
def error(x: str) -> None:
eprint('' + x)
def warning(x: str) -> None:
eprint('' + x) # todo yellow?
def tb(e: Exception) -> None:
tb = ''.join(traceback.format_exception(Exception, e, e.__traceback__))
sys.stderr.write(indent(tb))
def config_create() -> None:
from .preinit import get_mycfg_dir
mycfg_dir = get_mycfg_dir()
created = False
if not mycfg_dir.exists():
# todo not sure about the layout... should I use my/config.py instead?
my_config = mycfg_dir / 'my' / 'config' / '__init__.py'
my_config.parent.mkdir(parents=True)
my_config.write_text(
'''
### HPI personal config
## see
# https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-modules
# https://github.com/karlicoss/HPI/blob/master/doc/MODULES.org
## for some help on writing your own config
# to quickly check your config, run:
# hpi config check
# to quickly check a specific module setup, run hpi doctor <module>, e.g.:
# hpi doctor my.reddit.rexport
### useful default imports
from my.core import Paths, PathIsh, get_files
###
# most of your configs will look like this:
class example:
export_path: Paths = '/home/user/data/example_data_dir/'
### you can insert your own configuration below
### but feel free to delete the stuff above if you don't need ti
'''.lstrip()
)
info(f'created empty config: {my_config}')
created = True
else:
error(f"config directory '{mycfg_dir}' already exists, skipping creation")
check_passed = config_ok()
if not created or not check_passed:
sys.exit(1)
# todo return the config as a result?
def config_ok() -> bool:
errors: list[Exception] = []
# at this point 'my' should already be imported, so doesn't hurt to extract paths from it
import my
try:
paths: list[str] = list(my.__path__)
except Exception as e:
errors.append(e)
error('failed to determine module import path')
tb(e)
else:
info(f'import order: {paths}')
# first try doing as much as possible without actually importing my.config
from .preinit import get_mycfg_dir
cfg_path = get_mycfg_dir()
# alternative is importing my.config and then getting cfg_path from its __file__/__path__
# not sure which is better tbh
## check we're not using stub config
import my.core
try:
core_pkg_path = str(Path(my.core.__path__[0]).parent)
if str(cfg_path).startswith(core_pkg_path):
error(
f'''
Seems that the stub config is used ({cfg_path}). This is likely not going to work.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-modules for more information
'''.strip()
)
errors.append(RuntimeError('bad config path'))
except Exception as e:
errors.append(e)
tb(e)
else:
info(f"config path : {cfg_path}")
##
## check syntax
with tempfile.TemporaryDirectory() as td:
# use a temporary directory, useful because
# - compileall ignores -B, so always craps with .pyc files (annoyng on RO filesystems)
# - compileall isn't following symlinks, just silently ignores them
tdir = Path(td) / 'cfg'
# NOTE: compileall still returns code 0 if the path doesn't exist..
# but in our case hopefully it's not an issue
cmd = [sys.executable, '-m', 'compileall', '-q', str(tdir)]
try:
# this will resolve symlinks when copying
# should be under try/catch since might fail if some symlinks are missing
shutil.copytree(cfg_path, tdir, dirs_exist_ok=True)
check_call(cmd)
info('syntax check: ' + ' '.join(cmd))
except Exception as e:
errors.append(e)
tb(e)
##
## check types
mypy_res = run_mypy(cfg_path)
if mypy_res is not None: # has mypy
rc = mypy_res.returncode
if rc == 0:
info('mypy check : success')
else:
error('mypy check: failed')
errors.append(RuntimeError('mypy failed'))
sys.stderr.write(indent(mypy_res.stderr.decode('utf8')))
sys.stderr.write(indent(mypy_res.stdout.decode('utf8')))
##
## finally, try actually importing the config (it should use same cfg_path)
try:
import my.config
except Exception as e:
errors.append(e)
error("failed to import the config")
tb(e)
##
if len(errors) > 0:
error(f'config check: {len(errors)} errors')
return False
# note: shouldn't exit here, might run something else
info('config check: success!')
return True
from .util import HPIModule, modules
def _modules(*, all: bool = False) -> Iterable[HPIModule]:
skipped = []
for m in modules():
if not all and m.skip_reason is not None:
skipped.append(m.name)
else:
yield m
if len(skipped) > 0:
warning(f'Skipped {len(skipped)} modules: {skipped}. Pass --all if you want to see them.')
def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: list[str]) -> None:
if len(for_modules) > 0:
# if you're checking specific modules, show errors
# hopefully makes sense?
verbose = True
vw = '' if verbose else '; pass --verbose to print more information'
tabulate_warnings()
import contextlib
from .error import warn_my_config_import_error
from .stats import get_stats, quick_stats
from .util import HPIModule
mods: Iterable[HPIModule]
if len(for_modules) == 0:
mods = _modules(all=list_all)
else:
mods = [HPIModule(name=m, skip_reason=None) for m in for_modules]
# todo add a --all argument to disregard is_active check?
for mr in mods:
skip = mr.skip_reason
m = mr.name
if skip is not None:
eprint(f'{OFF} {click.style("SKIP", fg="yellow")}: {m:<50} {skip}')
continue
try:
mod = importlib.import_module(m) # noqa: F841
except Exception as e:
# todo more specific command?
error(f'{click.style("FAIL", fg="red")}: {m:<50} loading failed{vw}')
# check that this is an import error in particular, not because
# of a ModuleNotFoundError because some dependency wasn't installed
if isinstance(e, (ImportError, AttributeError)):
warn_my_config_import_error(e)
if verbose:
tb(e)
continue
info(f'{click.style("OK", fg="green")} : {m:<50}')
# TODO add hpi 'stats'? instead of doctor? not sure
stats = get_stats(m, guess=True)
if stats is None:
eprint(" - no 'stats' function, can't check the data")
# todo point to a readme on the module structure or something?
continue
quick_context = quick_stats() if quick else contextlib.nullcontext()
try:
kwargs = {}
# todo hmm why wouldn't they be callable??
if callable(stats) and 'quick' in inspect.signature(stats).parameters:
kwargs['quick'] = quick
with quick_context:
res = stats(**kwargs)
assert res is not None, 'stats() returned None'
except Exception as ee:
warning(f' - {click.style("stats:", fg="red")} computing failed{vw}')
if verbose:
tb(ee)
else:
info(f' - stats: {res}')
def list_modules(*, list_all: bool) -> None:
# todo add a --sort argument?
tabulate_warnings()
for mr in _modules(all=list_all):
m = mr.name
sr = mr.skip_reason
if sr is None:
pre = OK
suf = ''
else:
pre = OFF
suf = f' {click.style(f"[disabled: {sr}]", fg="yellow")}'
click.echo(f'{pre} {m:50}{suf}')
def tabulate_warnings() -> None:
'''
Helper to avoid visual noise in hpi modules/doctor
'''
import warnings
orig = warnings.formatwarning
def override(*args, **kwargs) -> str:
res = orig(*args, **kwargs)
return ''.join(' ' + x for x in res.splitlines(keepends=True))
warnings.formatwarning = override
# TODO loggers as well?
def _requires(modules: Sequence[str]) -> Sequence[str]:
from .discovery_pure import module_by_name
mods = [module_by_name(module) for module in modules]
res = []
for mod in mods:
if mod.legacy is not None:
warning(mod.legacy)
reqs = mod.requires
if reqs is None:
warning(f"Module {mod.name} has no REQUIRES specification")
continue
for r in reqs:
if r not in res:
res.append(r)
return res
def module_requires(*, module: Sequence[str]) -> None:
if isinstance(module, str):
# legacy behavior, used to take a since argument
module = [module]
rs = [f"'{x}'" for x in _requires(modules=module)]
eprint(f'dependencies of {module}')
for x in rs:
click.echo(x)
def module_install(*, user: bool, module: Sequence[str], parallel: bool = False, break_system_packages: bool = False) -> None:
if isinstance(module, str):
# legacy behavior, used to take a since argument
module = [module]
requirements = _requires(module)
if len(requirements) == 0:
warning('requirements list is empty, no need to install anything')
return
use_uv = 'HPI_MODULE_INSTALL_USE_UV' in os.environ
pre_cmd = [
sys.executable, '-m', *(['uv'] if use_uv else []), 'pip',
'install',
*(['--user'] if user else []), # todo maybe instead, forward all the remaining args to pip?
*(['--break-system-packages'] if break_system_packages else []), # https://peps.python.org/pep-0668/
]
cmds = []
# disable parallel on windows, sometimes throws a
# '[WinError 32] The process cannot access the file because it is being used by another process'
# same on mac it seems? possible race conditions which are hard to debug?
# WARNING: Error parsing requirements for sqlalchemy: [Errno 2] No such file or directory: '/Users/runner/work/HPI/HPI/.tox/mypy-misc/lib/python3.7/site-packages/SQLAlchemy-2.0.4.dist-info/METADATA'
if parallel and sys.platform not in ['win32', 'cygwin', 'darwin']:
# todo not really sure if it's safe to install in parallel like this
# but definitely doesn't hurt to experiment for e.g. mypy pipelines
# pip has '--use-feature=fast-deps', but it doesn't really work
# I think it only helps for pypi artifacts (not git!),
# and only if they weren't cached
for r in requirements:
cmds.append([*pre_cmd, r])
else:
if parallel:
warning('parallel install is not supported on this platform, installing sequentially...')
# install everything in one cmd
cmds.append(pre_cmd + list(requirements))
with ExitStack() as exit_stack:
popens = []
for cmd in cmds:
eprint('Running: ' + ' '.join(map(shlex.quote, cmd)))
popen = exit_stack.enter_context(Popen(cmd))
popens.append(popen)
for popen in popens:
ret = popen.wait()
assert ret == 0, popen
def _ui_getchar_pick(choices: Sequence[str], prompt: str = 'Select from: ') -> int:
'''
Basic menu allowing the user to select one of the choices
returns the index the user chose
'''
assert len(choices) > 0, 'Didnt receive any choices to prompt!'
eprint(prompt + '\n')
# prompts like 1,2,3,4,5,6,7,8,9,a,b,c,d,e,f...
chr_offset = ord('a') - 10
# dict from key user can press -> resulting index
result_map = {}
for i, opt in enumerate(choices, 1):
char: str = str(i) if i < 10 else chr(i + chr_offset)
result_map[char] = i - 1
eprint(f'\t{char}. {opt}')
eprint('')
while True:
ch = click.getchar()
if ch not in result_map:
eprint(f'{ch} not in {list(result_map.keys())}')
continue
return result_map[ch]
def _locate_functions_or_prompt(qualified_names: list[str], *, prompt: bool = True) -> Iterable[Callable[..., Any]]:
from .query import QueryException, locate_qualified_function
from .stats import is_data_provider
# if not connected to a terminal, can't prompt
if not sys.stdout.isatty():
prompt = False
for qualname in qualified_names:
try:
# common-case
yield locate_qualified_function(qualname)
except QueryException as qr_err:
# maybe the user specified a module name instead of a function name?
# try importing the name the user specified as a module and prompt the
# user to select a 'data provider' like function
try:
mod = importlib.import_module(qualname)
except Exception as ie:
eprint(f"During fallback, importing '{qualname}' as module failed")
raise qr_err from ie
# find data providers in this module
data_providers = [f for _, f in inspect.getmembers(mod, inspect.isfunction) if is_data_provider(f)]
if len(data_providers) == 0:
eprint(f"During fallback, could not find any data providers in '{qualname}'")
raise qr_err
else:
# was only one data provider-like function, use that
if len(data_providers) == 1:
yield data_providers[0]
else:
choices = [f.__name__ for f in data_providers]
if prompt is False:
# there's more than one possible data provider in this module,
# STDOUT is not a TTY, can't prompt
eprint("During fallback, more than one possible data provider, can't prompt since STDOUT is not a TTY")
eprint("Specify one of:")
for funcname in choices:
eprint(f"\t{qualname}.{funcname}")
raise qr_err
# prompt the user to pick the function to use
chosen_index = _ui_getchar_pick(choices, f"Which function should be used from '{qualname}'?")
# respond to the user, so they know something has been picked
eprint(f"Selected '{choices[chosen_index]}'")
yield data_providers[chosen_index]
def _warn_exceptions(exc: Exception) -> None:
from my.core import make_logger
logger = make_logger('CLI', level='warning')
logger.exception(f'hpi query: {exc}')
# handle the 'hpi query' call
# can raise a QueryException, caught in the click command
def query_hpi_functions(
*,
output: str = 'json',
stream: bool = False,
qualified_names: list[str],
order_key: str | None,
order_by_value_type: type | None,
after: Any,
before: Any,
within: Any,
reverse: bool = False,
limit: int | None,
drop_unsorted: bool,
wrap_unsorted: bool,
warn_exceptions: bool,
raise_exceptions: bool,
drop_exceptions: bool,
) -> None:
from .query_range import RangeTuple, select_range
# chain list of functions from user, in the order they wrote them on the CLI
input_src = chain(*(f() for f in _locate_functions_or_prompt(qualified_names)))
# NOTE: if passing just one function to this which returns a single namedtuple/dataclass,
# using both --order-key and --order-type will often be faster as it does not need to
# duplicate the iterator in memory, or try to find the --order-type type on each object before sorting
res = select_range(
input_src,
order_key=order_key,
order_by_value_type=order_by_value_type,
unparsed_range=RangeTuple(after=after, before=before, within=within),
reverse=reverse,
limit=limit,
drop_unsorted=drop_unsorted,
wrap_unsorted=wrap_unsorted,
warn_exceptions=warn_exceptions,
warn_func=_warn_exceptions,
raise_exceptions=raise_exceptions,
drop_exceptions=drop_exceptions,
)
if output == 'json':
from .serialize import dumps
if stream:
for item in res:
# use sys.stdout directly
# the overhead form click.echo isn't a *lot*, but when called in a loop
# with potentially millions of items it makes a noticeable difference
sys.stdout.write(dumps(item))
sys.stdout.write('\n')
sys.stdout.flush()
else:
click.echo(dumps(list(res)))
elif output == 'pprint':
from pprint import pprint
if stream:
for item in res:
pprint(item)
else:
pprint(list(res))
elif output == 'gpx':
from my.location.common import locations_to_gpx
# if user didn't specify to ignore exceptions, warn if locations_to_gpx
# cannot process the output of the command. This can be silenced by
# passing --drop-exceptions
if not raise_exceptions and not drop_exceptions:
warn_exceptions = True
# can ignore the mypy warning here, locations_to_gpx yields any errors
# if you didnt pass it something that matches the LocationProtocol
for exc in locations_to_gpx(res, sys.stdout): # type: ignore[arg-type]
if warn_exceptions:
_warn_exceptions(exc)
elif raise_exceptions:
raise exc
elif drop_exceptions:
pass
sys.stdout.flush()
else:
res = list(res) # type: ignore[assignment]
# output == 'repl'
eprint(f"\nInteract with the results by using the {click.style('res', fg='green')} variable\n")
try:
import IPython # type: ignore[import,unused-ignore]
except ModuleNotFoundError:
eprint("'repl' typically uses ipython, install it with 'python3 -m pip install ipython'. falling back to stdlib...")
import code
code.interact(local=locals())
else:
IPython.embed()
@click.group()
@click.option("--debug", is_flag=True, default=False, help="Show debug logs")
def main(*, debug: bool) -> None:
'''
Human Programming Interface
Tool for HPI
Work in progress, will be used for config management, troubleshooting & introspection
'''
# should overwrite anything else in LOGGING_LEVEL_HPI
if debug:
os.environ['LOGGING_LEVEL_HPI'] = 'debug'
# for potential future reference, if shared state needs to be added to groups
# https://click.palletsprojects.com/en/7.x/commands/#group-invocation-without-command
# https://click.palletsprojects.com/en/7.x/commands/#multi-command-chaining
# acts as a contextmanager of sorts - any subcommand will then run
# in something like /tmp/hpi_temp_dir
# to avoid importing relative modules by accident during development
# maybe can be removed later if there's more test coverage/confidence that nothing
# would happen?
# use a particular directory instead of a random one, since
# click being decorator based means its more complicated
# to run things at the end (would need to use a callback or pass context)
# https://click.palletsprojects.com/en/7.x/commands/#nested-handling-and-contexts
tdir = Path(tempfile.gettempdir()) / 'hpi_temp_dir'
tdir.mkdir(exist_ok=True)
os.chdir(tdir)
@functools.lru_cache(maxsize=1)
def _all_mod_names() -> list[str]:
"""Should include all modules, in case user is trying to diagnose issues"""
# sort this, so that the order doesn't change while tabbing through
return sorted([m.name for m in modules()])
def _module_autocomplete(ctx: click.Context, args: Sequence[str], incomplete: str) -> list[str]:
return [m for m in _all_mod_names() if m.startswith(incomplete)]
@main.command(name='doctor', short_help='run various checks')
@click.option('--verbose/--quiet', default=False, help='Print more diagnostic information')
@click.option('--all', 'list_all', is_flag=True, help='List all modules, including disabled')
@click.option('-q', '--quick', is_flag=True, help='Only run partial checks (first 100 items)')
@click.option('-S', '--skip-config-check', 'skip_conf', is_flag=True, help='Skip configuration check')
@click.argument('MODULE', nargs=-1, required=False, shell_complete=_module_autocomplete)
def doctor_cmd(*, verbose: bool, list_all: bool, quick: bool, skip_conf: bool, module: Sequence[str]) -> None:
'''
Run various checks
MODULE is one or more specific module names to check (e.g. my.reddit.rexport)
Otherwise, checks all modules
'''
if not skip_conf:
config_ok()
# TODO check that it finds private modules too?
modules_check(verbose=verbose, list_all=list_all, quick=quick, for_modules=list(module))
@main.group(name='config', short_help='work with configuration')
def config_grp() -> None:
'''Act on your HPI configuration'''
pass
@config_grp.command(name='check', short_help='check config')
def config_check_cmd() -> None:
'''Check your HPI configuration file'''
ok = config_ok()
sys.exit(0 if ok else False)
@config_grp.command(name='create', short_help='create user config')
def config_create_cmd() -> None:
'''Create user configuration file for HPI'''
config_create()
@main.command(name='modules', short_help='list available modules')
@click.option('--all', 'list_all', is_flag=True, help='List all modules, including disabled')
def module_cmd(*, list_all: bool) -> None:
'''List available modules'''
list_modules(list_all=list_all)
@main.group(name='module', short_help='module management')
def module_grp() -> None:
'''Module management'''
pass
@module_grp.command(name='requires', short_help='print module reqs')
@click.argument('MODULES', shell_complete=_module_autocomplete, nargs=-1, required=True)
def module_requires_cmd(*, modules: Sequence[str]) -> None:
'''
Print MODULES requirements
MODULES is one or more specific module names (e.g. my.reddit.rexport)
'''
module_requires(module=modules)
@module_grp.command(name='install', short_help='install module deps')
@click.option('--user', is_flag=True, help='same as pip --user')
@click.option('--parallel', is_flag=True, help='EXPERIMENTAL. Install dependencies in parallel.')
@click.option('-B',
'--break-system-packages',
is_flag=True,
help='Bypass PEP 668 and install dependencies into the system-wide python package directory.')
@click.argument('MODULES', shell_complete=_module_autocomplete, nargs=-1, required=True)
def module_install_cmd(*, user: bool, parallel: bool, break_system_packages: bool, modules: Sequence[str]) -> None:
'''
Install dependencies for modules using pip
MODULES is one or more specific module names (e.g. my.reddit.rexport)
'''
# todo could add functions to check specific module etc..
module_install(user=user, module=modules, parallel=parallel, break_system_packages=break_system_packages)
@main.command(name='query', short_help='query the results of a HPI function')
@click.option('-o',
'--output',
default='json',
type=click.Choice(['json', 'pprint', 'repl', 'gpx']),
help='what to do with the result [default: json]')
@click.option('-s',
'--stream',
default=False,
is_flag=True,
help='stream objects from the data source instead of printing a list at the end')
@click.option('-k',
'--order-key',
default=None,
type=str,
help='order by an object attribute or dict key on the individual objects returned by the HPI function')
@click.option('-t',
'--order-type',
default=None,
type=click.Choice(['datetime', 'date', 'int', 'float']),
help='order by searching for some type on the iterable')
@click.option('-a',
'--after',
default=None,
type=str,
help='while ordering, filter items for the key or type larger than or equal to this')
@click.option('-b',
'--before',
default=None,
type=str,
help='while ordering, filter items for the key or type smaller than this')
@click.option('-w',
'--within',
default=None,
type=str,
help="a range 'after' or 'before' to filter items by. see above for further explanation")
@click.option('-r',
'--recent',
default=None,
type=str,
help="a shorthand for '--order-type datetime --reverse --before now --within'. e.g. --recent 5d")
@click.option('--reverse/--no-reverse',
default=False,
help='reverse the results returned from the functions')
@click.option('-l',
'--limit',
default=None,
type=int,
help='limit the number of items returned from the (functions)')
@click.option('--drop-unsorted',
default=False,
is_flag=True,
help="if the order of an item can't be determined while ordering, drop those items from the results")
@click.option('--wrap-unsorted',
default=False,
is_flag=True,
help="if the order of an item can't be determined while ordering, wrap them into an 'Unsortable' object")
@click.option('--warn-exceptions',
default=False,
is_flag=True,
help="if any errors are returned, print them as errors on STDERR")
@click.option('--raise-exceptions',
default=False,
is_flag=True,
help="if any errors are returned (as objects, not raised) from the functions, raise them")
@click.option('--drop-exceptions',
default=False,
is_flag=True,
help='ignore any errors returned as objects from the functions')
@click.argument('FUNCTION_NAME', nargs=-1, required=True, shell_complete=_module_autocomplete)
def query_cmd(
*,
function_name: Sequence[str],
output: str,
stream: bool,
order_key: str | None,
order_type: str | None,
after: str | None,
before: str | None,
within: str | None,
recent: str | None,
reverse: bool,
limit: int | None,
drop_unsorted: bool,
wrap_unsorted: bool,
warn_exceptions: bool,
raise_exceptions: bool,
drop_exceptions: bool,
) -> None:
'''
This allows you to query the results from one or more functions in HPI
By default this runs with '-o json', converting the results
to JSON and printing them to STDOUT
You can specify '-o pprint' to just print the objects using their
repr, or '-o repl' to drop into a ipython shell with access to the results
While filtering using --order-key datetime, the --after, --before and --within
flags parse the input to their datetime and timedelta equivalents. datetimes can
be epoch time, the string 'now', or an date formatted in the ISO format. timedelta
(durations) are parsed from a similar format to the GNU 'sleep' command, e.g.
1w2d8h5m20s -> 1 week, 2 days, 8 hours, 5 minutes, 20 seconds
As an example, to query reddit comments I've made in the last month
\b
hpi query --order-type datetime --before now --within 4w my.reddit.all.comments
or...
hpi query --recent 4w my.reddit.all.comments
\b
Can also query within a range. To filter comments between 2016 and 2018:
hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
'''
from datetime import date, datetime
chosen_order_type: type | None
if order_type == "datetime":
chosen_order_type = datetime
elif order_type == "date":
chosen_order_type = date
elif order_type == "int":
chosen_order_type = int
elif order_type == "float":
chosen_order_type = float
else:
chosen_order_type = None
if recent is not None:
before = "now"
chosen_order_type = chosen_order_type or datetime # dont override if the user specified date
within = recent
reverse = not reverse
from .query import QueryException
try:
query_hpi_functions(
output=output,
stream=stream,
qualified_names=list(function_name),
order_key=order_key,
order_by_value_type=chosen_order_type,
after=after,
before=before,
within=within,
reverse=reverse,
limit=limit,
drop_unsorted=drop_unsorted,
wrap_unsorted=wrap_unsorted,
warn_exceptions=warn_exceptions,
raise_exceptions=raise_exceptions,
drop_exceptions=drop_exceptions,
)
except QueryException as qe:
eprint(str(qe))
sys.exit(1)
# todo: add more tests?
# its standard click practice to have the function click calls be a separate
# function from the decorated function, as it allows the application-specific code to be
# more testable. also allows hpi commands to be imported and called manually from
# other python code
def test_requires() -> None:
from click.testing import CliRunner
result = CliRunner().invoke(main, ['module', 'requires', 'my.github.ghexport', 'my.browser.export'])
assert result.exit_code == 0
assert "github.com/karlicoss/ghexport" in result.output
assert "browserexport" in result.output
if __name__ == '__main__':
# prog_name is so that if this is invoked with python -m my.core
# this still shows hpi in the help text
main(prog_name='hpi')

35
my/core/_cpu_pool.py Normal file
View file

@ -0,0 +1,35 @@
"""
EXPERIMENTAL! use with caution
Manages 'global' ProcessPoolExecutor which is 'managed' by HPI itself, and
can be passed down to DALs to speed up data processing.
The reason to have it managed by HPI is because we don't want DALs instantiate pools
themselves -- they can't cooperate and it would be hard/infeasible to control
how many cores we want to dedicate to the DAL.
Enabled by the env variable, specifying how many cores to dedicate
e.g. "HPI_CPU_POOL=4 hpi query ..."
"""
from __future__ import annotations
import os
from concurrent.futures import ProcessPoolExecutor
from typing import cast
_NOT_SET = cast(ProcessPoolExecutor, object())
_INSTANCE: ProcessPoolExecutor | None = _NOT_SET
def get_cpu_pool() -> ProcessPoolExecutor | None:
global _INSTANCE
if _INSTANCE is _NOT_SET:
use_cpu_pool = os.environ.get('HPI_CPU_POOL')
if use_cpu_pool is None or int(use_cpu_pool) == 0:
_INSTANCE = None
else:
# NOTE: this won't be cleaned up properly, but I guess it's fine?
# since this it's basically a singleton for the whole process
# , and will be destroyed when python exists
_INSTANCE = ProcessPoolExecutor(max_workers=int(use_cpu_pool))
return _INSTANCE

View file

@ -0,0 +1,12 @@
from ..common import PathIsh
from ..sqlite import sqlite_connect_immutable
def connect_readonly(db: PathIsh):
import dataset # type: ignore
# see https://github.com/pudo/dataset/issues/136#issuecomment-128693122
# todo not sure if mode=ro has any benefit, but it doesn't work on read-only filesystems
# maybe it should autodetect readonly filesystems and apply this? not sure
creator = lambda: sqlite_connect_immutable(db)
return dataset.connect('sqlite:///', engine_kwargs={'creator': creator})

View file

@ -0,0 +1,261 @@
"""
Various helpers for compression
"""
# fmt: off
from __future__ import annotations
import io
import pathlib
from collections.abc import Iterator, Sequence
from datetime import datetime
from functools import total_ordering
from pathlib import Path
from typing import IO, Union
PathIsh = Union[Path, str]
class Ext:
xz = '.xz'
zip = '.zip'
lz4 = '.lz4'
zstd = '.zstd'
zst = '.zst'
targz = '.tar.gz'
def is_compressed(p: Path) -> bool:
# todo kinda lame way for now.. use mime ideally?
# should cooperate with kompress.kopen?
return any(p.name.endswith(ext) for ext in [Ext.xz, Ext.zip, Ext.lz4, Ext.zstd, Ext.zst, Ext.targz])
def _zstd_open(path: Path, *args, **kwargs) -> IO:
import zstandard as zstd # type: ignore
fh = path.open('rb')
dctx = zstd.ZstdDecompressor()
reader = dctx.stream_reader(fh)
mode = kwargs.get('mode', 'rt')
if mode == 'rb':
return reader
else:
# must be text mode
kwargs.pop('mode') # TextIOWrapper doesn't like it
return io.TextIOWrapper(reader, **kwargs) # meh
# TODO use the 'dependent type' trick for return type?
def kopen(path: PathIsh, *args, mode: str='rt', **kwargs) -> IO:
# just in case, but I think this shouldn't be necessary anymore
# since when we call .read_text, encoding is passed already
if mode in {'r', 'rt'}:
encoding = kwargs.get('encoding', 'utf8')
else:
encoding = None
kwargs['encoding'] = encoding
pp = Path(path)
name = pp.name
if name.endswith(Ext.xz):
import lzma
# ugh. for lzma, 'r' means 'rb'
# https://github.com/python/cpython/blob/d01cf5072be5511595b6d0c35ace6c1b07716f8d/Lib/lzma.py#L97
# whereas for regular open, 'r' means 'rt'
# https://docs.python.org/3/library/functions.html#open
if mode == 'r':
mode = 'rt'
kwargs['mode'] = mode
return lzma.open(pp, *args, **kwargs)
elif name.endswith(Ext.zip):
# eh. this behaviour is a bit dodgy...
from zipfile import ZipFile
zfile = ZipFile(pp)
[subpath] = args # meh?
## oh god... https://stackoverflow.com/a/5639960/706389
ifile = zfile.open(subpath, mode='r')
ifile.readable = lambda: True # type: ignore
ifile.writable = lambda: False # type: ignore
ifile.seekable = lambda: False # type: ignore
ifile.read1 = ifile.read # type: ignore
# TODO pass all kwargs here??
# todo 'expected "BinaryIO"'??
return io.TextIOWrapper(ifile, encoding=encoding)
elif name.endswith(Ext.lz4):
import lz4.frame # type: ignore
return lz4.frame.open(str(pp), mode, *args, **kwargs)
elif name.endswith(Ext.zstd) or name.endswith(Ext.zst): # noqa: PIE810
kwargs['mode'] = mode
return _zstd_open(pp, *args, **kwargs)
elif name.endswith(Ext.targz):
import tarfile
# FIXME pass mode?
tf = tarfile.open(pp)
# TODO pass encoding?
x = tf.extractfile(*args); assert x is not None
return x
else:
return pp.open(mode, *args, **kwargs)
import os
import typing
if typing.TYPE_CHECKING:
# otherwise mypy can't figure out that BasePath is a type alias..
BasePath = pathlib.Path
else:
BasePath = pathlib.WindowsPath if os.name == 'nt' else pathlib.PosixPath
class CPath(BasePath):
"""
Hacky way to support compressed files.
If you can think of a better way to do this, please let me know! https://github.com/karlicoss/HPI/issues/20
Ugh. So, can't override Path because of some _flavour thing.
Path only has _accessor and _closed slots, so can't directly set .open method
_accessor.open has to return file descriptor, doesn't work for compressed stuff.
"""
def open(self, *args, **kwargs): # noqa: ARG002
kopen_kwargs = {}
mode = kwargs.get('mode')
if mode is not None:
kopen_kwargs['mode'] = mode
encoding = kwargs.get('encoding')
if encoding is not None:
kopen_kwargs['encoding'] = encoding
# TODO assert read only?
return kopen(str(self), **kopen_kwargs)
open = kopen # TODO deprecate
# meh
# TODO ideally switch to ZipPath or smth similar?
# nothing else supports subpath properly anyway
def kexists(path: PathIsh, subpath: str) -> bool:
try:
kopen(path, subpath)
except Exception:
return False
else:
return True
import zipfile
# meh... zipfile.Path is not available on 3.7
zipfile_Path = zipfile.Path
@total_ordering
class ZipPath(zipfile_Path):
# NOTE: is_dir/is_file might not behave as expected, the base class checks it only based on the slash in path
# seems that root/at are not exposed in the docs, so might be an implementation detail
root: zipfile.ZipFile # type: ignore[assignment]
at: str
@property
def filepath(self) -> Path:
res = self.root.filename
assert res is not None # make mypy happy
return Path(res)
@property
def subpath(self) -> Path:
return Path(self.at)
def absolute(self) -> ZipPath:
return ZipPath(self.filepath.absolute(), self.at)
def expanduser(self) -> ZipPath:
return ZipPath(self.filepath.expanduser(), self.at)
def exists(self) -> bool:
if self.at == '':
# special case, the base class returns False in this case for some reason
return self.filepath.exists()
return super().exists() or self._as_dir().exists()
def _as_dir(self) -> zipfile_Path:
# note: seems that zip always uses forward slash, regardless OS?
return zipfile_Path(self.root, self.at + '/')
def rglob(self, glob: str) -> Iterator[ZipPath]:
# note: not 100% sure about the correctness, but seem fine?
# Path.match() matches from the right, so need to
rpaths = [p for p in self.root.namelist() if p.startswith(self.at)]
rpaths = [p for p in rpaths if Path(p).match(glob)]
return (ZipPath(self.root, p) for p in rpaths)
def relative_to(self, other: ZipPath) -> Path: # type: ignore[override, unused-ignore]
assert self.filepath == other.filepath, (self.filepath, other.filepath)
return self.subpath.relative_to(other.subpath)
@property
def parts(self) -> Sequence[str]:
# messy, but might be ok..
return self.filepath.parts + self.subpath.parts
def __truediv__(self, key) -> ZipPath:
# need to implement it so the return type is not zipfile.Path
tmp = zipfile_Path(self.root) / self.at / key
return ZipPath(self.root, tmp.at)
def iterdir(self) -> Iterator[ZipPath]:
for s in self._as_dir().iterdir():
yield ZipPath(s.root, s.at)
@property
def stem(self) -> str:
return self.subpath.stem
@property # type: ignore[misc]
def __class__(self):
return Path
def __eq__(self, other) -> bool:
# hmm, super class doesn't seem to treat as equals unless they are the same object
if not isinstance(other, ZipPath):
return False
return (self.filepath, self.subpath) == (other.filepath, other.subpath)
def __lt__(self, other) -> bool:
if not isinstance(other, ZipPath):
return False
return (self.filepath, self.subpath) < (other.filepath, other.subpath)
def __hash__(self) -> int:
return hash((self.filepath, self.subpath))
def stat(self) -> os.stat_result:
# NOTE: zip datetimes have no notion of time zone, usually they just keep local time?
# see https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure
dt = datetime(*self.root.getinfo(self.at).date_time)
ts = int(dt.timestamp())
params = dict( # noqa: C408
st_mode=0,
st_ino=0,
st_dev=0,
st_nlink=1,
st_uid=1000,
st_gid=1000,
st_size=0, # todo compute it properly?
st_atime=ts,
st_mtime=ts,
st_ctime=ts,
)
return os.stat_result(tuple(params.values()))
@property
def suffix(self) -> str:
return Path(self.parts[-1]).suffix
# fmt: on

163
my/core/cachew.py Normal file
View file

@ -0,0 +1,163 @@
from __future__ import annotations
from .internal import assert_subpackage
assert_subpackage(__name__)
import logging
import sys
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import (
TYPE_CHECKING,
Any,
Callable,
TypeVar,
Union,
cast,
overload,
)
import appdirs # type: ignore[import-untyped]
from . import warnings
PathIsh = Union[str, Path] # avoid circular import from .common
def disable_cachew() -> None:
try:
import cachew # noqa: F401 # unused, it's fine
except ImportError:
# nothing to disable
return
from cachew import settings
settings.ENABLE = False
@contextmanager
def disabled_cachew() -> Iterator[None]:
try:
import cachew # noqa: F401 # unused, it's fine
except ImportError:
# nothing to disable
yield
return
from cachew.extra import disabled_cachew
with disabled_cachew():
yield
def _appdirs_cache_dir() -> Path:
cd = Path(appdirs.user_cache_dir('my'))
cd.mkdir(exist_ok=True, parents=True)
return cd
_CACHE_DIR_NONE_HACK = Path('/tmp/hpi/cachew_none_hack')
def cache_dir(suffix: PathIsh | None = None) -> Path:
from . import core_config as CC
cdir_ = CC.config.get_cache_dir()
sp: Path | None = None
if suffix is not None:
sp = Path(suffix)
# guess if you do need absolute, better path it directly instead of as suffix?
assert not sp.is_absolute(), sp
# ok, so ideally we could just return cdir_ / sp
# however, this function was at first used without the suffix, e.g. cache_dir() / 'some_dir'
# but now cache_dir setting can also be None which means 'disable cache'
# changing return type to Optional means that it will break for existing users even if the cache isn't used
# it's kinda wrong.. so we use dummy path (_CACHE_DIR_NONE_HACK), and then strip it away in core.common.mcachew
# this logic is tested via test_cachew_dir_none
if cdir_ is None:
cdir = _CACHE_DIR_NONE_HACK
else:
cdir = cdir_
return cdir if sp is None else cdir / sp
"""See core.cachew.cache_dir for the explanation"""
_cache_path_dflt = cast(str, object())
# TODO I don't really like 'mcachew', just 'cache' would be better... maybe?
# todo ugh. I think it needs @doublewrap, otherwise @mcachew without args doesn't work
# but it's a bit problematic.. doublewrap works by defecting if the first arg is callable
# but here cache_path can also be a callable (for lazy/dynamic path)... so unclear how to detect this
def _mcachew_impl(cache_path=_cache_path_dflt, **kwargs):
"""
Stands for 'Maybe cachew'.
Defensive wrapper around @cachew to make it an optional dependency.
"""
if cache_path is _cache_path_dflt:
# wasn't specified... so we need to use cache_dir
cache_path = cache_dir()
if isinstance(cache_path, (str, Path)):
try:
# check that it starts with 'hack' path
Path(cache_path).relative_to(_CACHE_DIR_NONE_HACK)
except: # noqa: E722 bare except
pass # no action needed, doesn't start with 'hack' string
else:
# todo show warning? tbh unclear how to detect when user stopped using 'old' way and using suffix instead?
# if it does, means that user wanted to disable cache
cache_path = None
try:
import cachew
except ModuleNotFoundError:
warnings.high('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
return lambda orig_func: orig_func
else:
kwargs['cache_path'] = cache_path
return cachew.cachew(**kwargs)
if TYPE_CHECKING:
R = TypeVar('R')
if sys.version_info[:2] >= (3, 10):
from typing import ParamSpec
else:
from typing_extensions import ParamSpec
P = ParamSpec('P')
CC = Callable[P, R] # need to give it a name, if inlined into bound=, mypy runs in a bug
PathProvider = Union[PathIsh, Callable[P, PathIsh]]
# NOTE: in cachew, HashFunction type returns str
# however in practice, cachew always calls str for its result
# so perhaps better to switch it to Any in cachew as well
HashFunction = Callable[P, Any]
F = TypeVar('F', bound=Callable)
# we need two versions due to @doublewrap
# this is when we just annotate as @cachew without any args
@overload # type: ignore[no-overload-impl]
def mcachew(fun: F) -> F: ...
@overload
def mcachew(
cache_path: PathProvider | None = ...,
*,
force_file: bool = ...,
cls: type | None = ...,
depends_on: HashFunction = ...,
logger: logging.Logger | None = ...,
chunk_by: int = ...,
synthetic_key: str | None = ...,
) -> Callable[[F], F]: ...
else:
mcachew = _mcachew_impl

125
my/core/cfg.py Normal file
View file

@ -0,0 +1,125 @@
from __future__ import annotations
import importlib
import re
import sys
from collections.abc import Iterator
from contextlib import ExitStack, contextmanager
from typing import Any, Callable, TypeVar
Attrs = dict[str, Any]
C = TypeVar('C')
# todo not sure about it, could be overthinking...
# but short enough to change later
# TODO document why it's necessary?
def make_config(cls: type[C], migration: Callable[[Attrs], Attrs] = lambda x: x) -> C:
user_config = cls.__base__
old_props = {
# NOTE: deliberately use gettatr to 'force' class properties here
k: getattr(user_config, k)
for k in vars(user_config)
}
new_props = migration(old_props)
from dataclasses import fields
params = {
k: v
for k, v in new_props.items()
if k in {f.name for f in fields(cls)} # type: ignore[arg-type] # see https://github.com/python/typing_extensions/issues/115
}
# todo maybe return type here?
return cls(**params)
F = TypeVar('F')
@contextmanager
def _override_config(config: F) -> Iterator[F]:
'''
Temporary override for config's parameters, useful for testing/fake data/etc.
'''
orig_properties = {k: v for k, v in vars(config).items() if not k.startswith('__')}
try:
yield config
finally:
# ugh. __dict__ of type objects isn't writable..
for k, v in orig_properties.items():
setattr(config, k, v)
added = {k for k in set(vars(config).keys()).difference(set(orig_properties.keys())) if not k.startswith('__')}
for k in added:
delattr(config, k)
ModuleRegex = str
@contextmanager
def _reload_modules(modules: ModuleRegex) -> Iterator[None]:
# need to use list here, otherwise reordering with set might mess things up
def loaded_modules() -> list[str]:
return [name for name in sys.modules if re.fullmatch(modules, name)]
modules_before = loaded_modules()
# uhh... seems that reversed might make more sense -- not 100% sure why, but this works for tests/reddit.py
for m in reversed(modules_before):
# ugh... seems that reload works whereas pop doesn't work in some cases (e.g. on tests/reddit.py)
# sys.modules.pop(m, None)
importlib.reload(sys.modules[m])
try:
yield
finally:
modules_after = loaded_modules()
modules_before_set = set(modules_before)
for m in modules_after:
if m in modules_before_set:
# was previously loaded, so need to reload to pick up old config
importlib.reload(sys.modules[m])
else:
# wasn't previously loaded, so need to unload it
# otherwise it might fail due to missing config etc
sys.modules.pop(m, None)
@contextmanager
def tmp_config(*, modules: ModuleRegex | None = None, config=None):
if modules is None:
assert config is None
if modules is not None:
assert config is not None
import my.config
with ExitStack() as module_reload_stack, _override_config(my.config) as new_config:
if config is not None:
overrides = {k: v for k, v in vars(config).items() if not k.startswith('__')}
for k, v in overrides.items():
setattr(new_config, k, v)
if modules is not None:
module_reload_stack.enter_context(_reload_modules(modules))
yield new_config
def test_tmp_config() -> None:
class extra:
data_path = '/path/to/data'
with tmp_config() as c:
assert c.google != 'whatever'
assert not hasattr(c, 'extra')
c.extra = extra
c.google = 'whatever'
# todo hmm. not sure what should do about new properties??
assert not hasattr(c, 'extra')
assert c.google != 'whatever'
###
# todo properly deprecate, this isn't really meant for public use
override_config = _override_config

262
my/core/common.py Normal file
View file

@ -0,0 +1,262 @@
from __future__ import annotations
import os
from collections.abc import Iterable, Sequence
from glob import glob as do_glob
from pathlib import Path
from typing import (
TYPE_CHECKING,
Callable,
Generic,
TypeVar,
Union,
)
from . import compat, warnings
# some helper functions
# TODO start deprecating this? soon we'd be able to use Path | str syntax which is shorter and more explicit
PathIsh = Union[Path, str]
Paths = Union[Sequence[PathIsh], PathIsh]
DEFAULT_GLOB = '*'
def get_files(
pp: Paths,
glob: str = DEFAULT_GLOB,
*,
sort: bool = True,
guess_compression: bool = True,
) -> tuple[Path, ...]:
"""
Helper function to avoid boilerplate.
Tuple as return type is a bit friendlier for hashing/caching, so hopefully makes sense
"""
# TODO FIXME mm, some wrapper to assert iterator isn't empty?
sources: list[Path]
if isinstance(pp, Path):
sources = [pp]
elif isinstance(pp, str):
if pp == '':
# special case -- makes sense for optional data sources, etc
return () # early return to prevent warnings etc
sources = [Path(pp)]
else:
sources = [p if isinstance(p, Path) else Path(p) for p in pp]
def caller() -> str:
import traceback
# TODO ugh. very flaky... -3 because [<this function>, get_files(), <actual caller>]
return traceback.extract_stack()[-3].filename
paths: list[Path] = []
for src in sources:
if src.parts[0] == '~':
src = src.expanduser()
# note: glob handled first, because e.g. on Windows asterisk makes is_dir unhappy
gs = str(src)
if '*' in gs:
if glob != DEFAULT_GLOB:
warnings.medium(f"{caller()}: treating {gs} as glob path. Explicit glob={glob} argument is ignored!")
paths.extend(map(Path, do_glob(gs))) # noqa: PTH207
elif os.path.isdir(str(src)): # noqa: PTH112
# NOTE: we're using os.path here on purpose instead of src.is_dir
# the reason is is_dir for archives might return True and then
# this clause would try globbing insize the archives
# this is generally undesirable (since modules handle archives themselves)
# todo not sure if should be recursive?
# note: glob='**/*.ext' works without any changes.. so perhaps it's ok as it is
gp: Iterable[Path] = src.glob(glob)
paths.extend(gp)
else:
assert src.exists(), src
# todo assert matches glob??
paths.append(src)
if sort:
paths = sorted(paths)
if len(paths) == 0:
# todo make it conditionally defensive based on some global settings
warnings.high(f'''
{caller()}: no paths were matched against {pp}. This might result in missing data. Likely, the directory you passed is empty.
'''.strip())
# traceback is useful to figure out what config caused it?
import traceback
traceback.print_stack()
if guess_compression:
from .kompress import CPath, ZipPath, is_compressed
# NOTE: wrap is just for backwards compat with vendorized kompress
# with kompress library, only is_compressed check and Cpath should be enough
def wrap(p: Path) -> Path:
if isinstance(p, ZipPath):
return p
if p.suffix == '.zip':
return ZipPath(p) # type: ignore[return-value]
if is_compressed(p):
return CPath(p)
return p
paths = [wrap(p) for p in paths]
return tuple(paths)
_R = TypeVar('_R')
# https://stackoverflow.com/a/5192374/706389
# NOTE: it was added to stdlib in 3.9 and then deprecated in 3.11
# seems that the suggested solution is to use custom decorator?
class classproperty(Generic[_R]):
def __init__(self, f: Callable[..., _R]) -> None:
self.f = f
def __get__(self, obj, cls) -> _R:
return self.f(cls)
def test_classproperty() -> None:
from .compat import assert_type
class C:
@classproperty
def prop(cls) -> str:
return 'hello'
res = C.prop
assert_type(res, str)
assert res == 'hello'
# hmm, this doesn't really work with mypy well..
# https://github.com/python/mypy/issues/6244
# class staticproperty(Generic[_R]):
# def __init__(self, f: Callable[[], _R]) -> None:
# self.f = f
#
# def __get__(self) -> _R:
# return self.f()
import re
# https://stackoverflow.com/a/295466/706389
def get_valid_filename(s: str) -> str:
s = str(s).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', s)
# TODO deprecate and suggest to use one from my.core directly? not sure
from .utils.itertools import unique_everseen # noqa: F401
### legacy imports, keeping them here for backwards compatibility
## hiding behind TYPE_CHECKING so it works in runtime
## in principle, warnings.deprecated decorator should cooperate with mypy, but doesn't look like it works atm?
## perhaps it doesn't work when it's used from typing_extensions
if not TYPE_CHECKING:
from .compat import deprecated
@deprecated('use my.core.compat.assert_never instead')
def assert_never(*args, **kwargs):
return compat.assert_never(*args, **kwargs)
@deprecated('use my.core.compat.fromisoformat instead')
def isoparse(*args, **kwargs):
return compat.fromisoformat(*args, **kwargs)
@deprecated('use more_itertools.one instead')
def the(*args, **kwargs):
import more_itertools
return more_itertools.one(*args, **kwargs)
@deprecated('use functools.cached_property instead')
def cproperty(*args, **kwargs):
import functools
return functools.cached_property(*args, **kwargs)
@deprecated('use more_itertools.bucket instead')
def group_by_key(l, key):
res = {}
for i in l:
kk = key(i)
lst = res.get(kk, [])
lst.append(i)
res[kk] = lst
return res
@deprecated('use my.core.utils.itertools.make_dict instead')
def make_dict(*args, **kwargs):
from .utils import itertools as UI
return UI.make_dict(*args, **kwargs)
@deprecated('use my.core.utils.itertools.listify instead')
def listify(*args, **kwargs):
from .utils import itertools as UI
return UI.listify(*args, **kwargs)
@deprecated('use my.core.warn_if_empty instead')
def warn_if_empty(*args, **kwargs):
from .utils import itertools as UI
return UI.listify(*args, **kwargs)
@deprecated('use my.core.stat instead')
def stat(*args, **kwargs):
from . import stats
return stats.stat(*args, **kwargs)
@deprecated('use my.core.make_logger instead')
def LazyLogger(*args, **kwargs):
from . import logging
return logging.LazyLogger(*args, **kwargs)
@deprecated('use my.core.types.asdict instead')
def asdict(*args, **kwargs):
from . import types
return types.asdict(*args, **kwargs)
# todo wrap these in deprecated decorator as well?
# TODO hmm how to deprecate these in runtime?
# tricky cause they are actually classes/types
from typing import Literal # noqa: F401
from .cachew import mcachew # noqa: F401
# this is kinda internal, should just use my.core.logging.setup_logger if necessary
from .logging import setup_logger
from .stats import Stats
from .types import (
Json,
datetime_aware,
datetime_naive,
)
tzdatetime = datetime_aware
else:
from .compat import Never
# make these invalid during type check while working in runtime
Stats = Never
tzdatetime = Never
Json = Never
datetime_naive = Never
datetime_aware = Never
###

139
my/core/compat.py Normal file
View file

@ -0,0 +1,139 @@
'''
Contains backwards compatibility helpers for different python versions.
If something is relevant to HPI itself, please put it in .hpi_compat instead
'''
from __future__ import annotations
import sys
from typing import TYPE_CHECKING
if sys.version_info[:2] >= (3, 13):
from warnings import deprecated
else:
from typing_extensions import deprecated
# keeping just for backwards compatibility, used to have compat implementation for 3.6
if not TYPE_CHECKING:
import sqlite3
@deprecated('use .backup method on sqlite3.Connection directly instead')
def sqlite_backup(*, source: sqlite3.Connection, dest: sqlite3.Connection, **kwargs) -> None:
# TODO warn here?
source.backup(dest, **kwargs)
# keeping for runtime backwards compatibility (added in 3.9)
@deprecated('use .removeprefix method on string directly instead')
def removeprefix(text: str, prefix: str) -> str:
return text.removeprefix(prefix)
@deprecated('use .removesuffix method on string directly instead')
def removesuffix(text: str, suffix: str) -> str:
return text.removesuffix(suffix)
##
## used to have compat function before 3.8 for these, keeping for runtime back compatibility
from functools import cached_property
from typing import Literal, Protocol, TypedDict
##
if sys.version_info[:2] >= (3, 10):
from typing import ParamSpec
else:
from typing_extensions import ParamSpec
# bisect_left doesn't have a 'key' parameter (which we use)
# till python3.10
if sys.version_info[:2] <= (3, 9):
from typing import Any, Callable, List, Optional, TypeVar # noqa: UP035
X = TypeVar('X')
# copied from python src
# fmt: off
def bisect_left(a: list[Any], x: Any, lo: int=0, hi: int | None=None, *, key: Callable[..., Any] | None=None) -> int:
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
# Note, the comparison uses "<" to match the
# __lt__() logic in list.sort() and in heapq.
if key is None:
while lo < hi:
mid = (lo + hi) // 2
if a[mid] < x:
lo = mid + 1
else:
hi = mid
else:
while lo < hi:
mid = (lo + hi) // 2
if key(a[mid]) < x:
lo = mid + 1
else:
hi = mid
return lo
# fmt: on
else:
from bisect import bisect_left
from datetime import datetime
if sys.version_info[:2] >= (3, 11):
fromisoformat = datetime.fromisoformat
else:
# fromisoformat didn't support Z as "utc" before 3.11
# https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat
def fromisoformat(date_string: str) -> datetime:
if date_string.endswith('Z'):
date_string = date_string[:-1] + '+00:00'
return datetime.fromisoformat(date_string)
def test_fromisoformat() -> None:
from datetime import timezone
# fmt: off
# feedbin has this format
assert fromisoformat('2020-05-01T10:32:02.925961Z') == datetime(
2020, 5, 1, 10, 32, 2, 925961, timezone.utc,
)
# polar has this format
assert fromisoformat('2018-11-28T22:04:01.304Z') == datetime(
2018, 11, 28, 22, 4, 1, 304000, timezone.utc,
)
# stackexchange, runnerup has this format
assert fromisoformat('2020-11-30T00:53:12Z') == datetime(
2020, 11, 30, 0, 53, 12, 0, timezone.utc,
)
# fmt: on
# arbtt has this format (sometimes less/more than 6 digits in milliseconds)
# TODO doesn't work atm, not sure if really should be supported...
# maybe should have flags for weird formats?
# assert isoparse('2017-07-18T18:59:38.21731Z') == datetime(
# 2017, 7, 18, 18, 59, 38, 217310, timezone.utc,
# )
if sys.version_info[:2] >= (3, 10):
from types import NoneType
from typing import TypeAlias
else:
NoneType = type(None)
from typing_extensions import TypeAlias
if sys.version_info[:2] >= (3, 11):
from typing import Never, assert_never, assert_type
else:
from typing_extensions import Never, assert_never, assert_type

173
my/core/core_config.py Normal file
View file

@ -0,0 +1,173 @@
'''
Bindings for the 'core' HPI configuration
'''
from __future__ import annotations
import re
from collections.abc import Sequence
from dataclasses import dataclass
from pathlib import Path
from . import warnings
try:
from my.config import core as user_config # type: ignore[attr-defined]
except Exception as e:
try:
from my.config import common as user_config # type: ignore[attr-defined]
warnings.high("'common' config section is deprecated. Please rename it to 'core'.")
except Exception as e2:
# make it defensive, because it's pretty commonly used and would be annoying if it breaks hpi doctor etc.
# this way it'll at least use the defaults
# todo actually not sure if needs a warning? Perhaps it's okay without it, because the defaults are reasonable enough
user_config = object
_HPI_CACHE_DIR_DEFAULT = ''
@dataclass
class Config(user_config):
'''
Config for the HPI itself.
To override, add to your config file something like
class config:
cache_dir = '/your/custom/cache/path'
'''
cache_dir: Path | str | None = _HPI_CACHE_DIR_DEFAULT
'''
Base directory for cachew.
- if None , means cache is disabled
- if '' (empty string), use user cache dir (see https://github.com/ActiveState/appdirs for more info). This is the default.
- otherwise , use the specified directory as base cache directory
NOTE: you shouldn't use this attribute in HPI modules directly, use Config.get_cache_dir()/cachew.cache_dir() instead
'''
tmp_dir: Path | str | None = None
'''
Path to a temporary directory.
This can be used temporarily while extracting zipfiles etc...
- if None , uses default determined by tempfile.gettempdir + 'HPI'
- otherwise , use the specified directory as the base temporary directory
'''
enabled_modules: Sequence[str] | None = None
'''
list of regexes/globs
- None means 'rely on disabled_modules'
'''
disabled_modules: Sequence[str] | None = None
'''
list of regexes/globs
- None means 'rely on enabled_modules'
'''
def get_cache_dir(self) -> Path | None:
cdir = self.cache_dir
if cdir is None:
return None
if cdir == _HPI_CACHE_DIR_DEFAULT:
from .cachew import _appdirs_cache_dir
return _appdirs_cache_dir()
else:
return Path(cdir).expanduser()
def get_tmp_dir(self) -> Path:
tdir: Path | str | None = self.tmp_dir
tpath: Path
# use tempfile if unset
if tdir is None:
import tempfile
tpath = Path(tempfile.gettempdir()) / 'HPI'
else:
tpath = Path(tdir)
tpath = tpath.expanduser()
tpath.mkdir(parents=True, exist_ok=True)
return tpath
def _is_module_active(self, module: str) -> bool | None:
# None means the config doesn't specify anything
# todo might be nice to return the 'reason' too? e.g. which option has matched
def matches(specs: Sequence[str]) -> str | None:
for spec in specs:
# not sure because . (packages separate) matches anything, but I guess unlikely to clash
if re.match(spec, module):
return spec
return None
on = matches(self.enabled_modules or [])
off = matches(self.disabled_modules or [])
if on is None:
if off is None:
# user is indifferent
return None
else:
return False
else: # not None
if off is None:
return True
else: # not None
# fallback onto the 'enable everything', then the user will notice
warnings.medium(f"[module]: conflicting regexes '{on}' and '{off}' are set in the config. Please only use one of them.")
return True
from .cfg import make_config
config = make_config(Config)
### tests start
from collections.abc import Iterator
from contextlib import contextmanager as ctx
@ctx
def _reset_config() -> Iterator[Config]:
# todo maybe have this decorator for the whole of my.config?
from .cfg import _override_config
with _override_config(config) as cc:
cc.enabled_modules = None
cc.disabled_modules = None
cc.cache_dir = None
yield cc
def test_active_modules() -> None:
import pytest
reset = _reset_config
with reset() as cc:
assert cc._is_module_active('my.whatever' ) is None
assert cc._is_module_active('my.core' ) is None
assert cc._is_module_active('my.body.exercise') is None
with reset() as cc:
cc.enabled_modules = ['my.whatever']
cc.disabled_modules = ['my.body.*']
assert cc._is_module_active('my.whatever' ) is True
assert cc._is_module_active('my.core' ) is None
assert cc._is_module_active('my.body.exercise') is False
with reset() as cc:
# if both are set, enable all
cc.disabled_modules = ['my.body.*']
cc.enabled_modules = ['my.body.exercise']
assert cc._is_module_active('my.whatever' ) is None
assert cc._is_module_active('my.core' ) is None
with pytest.warns(UserWarning, match=r"conflicting regexes") as record_warnings:
assert cc._is_module_active("my.body.exercise") is True
assert len(record_warnings) == 1
### tests end

5
my/core/dataset.py Normal file
View file

@ -0,0 +1,5 @@
from . import warnings
warnings.high(f"{__name__} is deprecated, please use dataset directly if you need or switch to my.core.sqlite")
from ._deprecated.dataset import *

179
my/core/denylist.py Normal file
View file

@ -0,0 +1,179 @@
"""
A helper module for defining denylists for sources programmatically
(in lamens terms, this lets you remove some output from a module you don't want)
For docs, see doc/DENYLIST.md
"""
from __future__ import annotations
import functools
import json
import sys
from collections import defaultdict
from collections.abc import Iterator, Mapping
from pathlib import Path
from typing import Any, TypeVar
import click
from more_itertools import seekable
from .serialize import dumps
from .warnings import medium
T = TypeVar("T")
DenyMap = Mapping[str, set[Any]]
def _default_key_func(obj: T) -> str:
return str(obj)
class DenyList:
def __init__(self, denylist_file: Path | str) -> None:
self.file = Path(denylist_file).expanduser().absolute()
self._deny_raw_list: list[dict[str, Any]] = []
self._deny_map: DenyMap = defaultdict(set)
# deny cli, user can override these
self.fzf_path = None
self._fzf_options = ()
self._deny_cli_key_func = None
def _load(self) -> None:
if not self.file.exists():
medium(f"denylist file {self.file} does not exist")
return
deny_map: DenyMap = defaultdict(set)
data: list[dict[str, Any]] = json.loads(self.file.read_text())
self._deny_raw_list = data
for ignore in data:
for k, v in ignore.items():
deny_map[k].add(v)
self._deny_map = deny_map
def load(self) -> DenyMap:
self._load()
return self._deny_map
def write(self) -> None:
if not self._deny_raw_list:
medium("no denylist data to write")
return
self.file.write_text(json.dumps(self._deny_raw_list))
@classmethod
def _is_json_primitive(cls, val: Any) -> bool:
return isinstance(val, (str, int, float, bool, type(None)))
@classmethod
def _stringify_value(cls, val: Any) -> Any:
# if it's a primitive, just return it
if cls._is_json_primitive(val):
return val
# otherwise, stringify-and-back so we can compare to
# json data loaded from the denylist file
return json.loads(dumps(val))
@classmethod
def _allow(cls, obj: T, deny_map: DenyMap) -> bool:
for deny_key, deny_set in deny_map.items():
# this should be done separately and not as part of the getattr
# because 'null'/None could actually be a value in the denylist,
# and the user may define behavior to filter that out
if not hasattr(obj, deny_key):
return False
val = cls._stringify_value(getattr(obj, deny_key))
# this object doesn't have have the attribute in the denylist
if val in deny_set:
return False
# if we tried all the denylist keys and didn't return False,
# then this object is allowed
return True
def filter(
self,
itr: Iterator[T],
*,
invert: bool = False,
) -> Iterator[T]:
denyf = functools.partial(self._allow, deny_map=self.load())
if invert:
return filter(lambda x: not denyf(x), itr)
return filter(denyf, itr)
def deny(self, key: str, value: Any, *, write: bool = False) -> None:
'''
add a key/value pair to the denylist
'''
if not self._deny_raw_list:
self._load()
self._deny_raw({key: self._stringify_value(value)}, write=write)
def _deny_raw(self, data: dict[str, Any], *, write: bool = False) -> None:
self._deny_raw_list.append(data)
if write:
self.write()
def _prompt_keys(self, item: T) -> str:
import pprint
click.echo(pprint.pformat(item))
# TODO: extract keys from item by checking if its dataclass/NT etc.?
resp = click.prompt("Key to deny on").strip()
if not hasattr(item, resp):
click.echo(f"Could not find key '{resp}' on item", err=True)
return self._prompt_keys(item)
return resp
def _deny_cli_remember(
self,
items: Iterator[T],
mem: dict[str, T],
) -> Iterator[str]:
keyf = self._deny_cli_key_func or _default_key_func
# i.e., convert each item to a string, and map str -> item
for item in items:
key = keyf(item)
mem[key] = item
yield key
def deny_cli(self, itr: Iterator[T]) -> None:
try:
from pyfzf import FzfPrompt
except ImportError:
click.echo("pyfzf is required to use the denylist cli, run 'python3 -m pip install pyfzf_iter'", err=True)
sys.exit(1)
# wrap in seekable so we can use it multiple times
# progressively caches the items as we iterate over them
sit = seekable(itr)
prompt_continue = True
while prompt_continue:
# reset the iterator
sit.seek(0)
# so we can map the selected string from fzf back to the original objects
memory_map: dict[str, T] = {}
picker = FzfPrompt(executable_path=self.fzf_path, default_options="--no-multi")
picked_l = picker.prompt(
self._deny_cli_remember(itr, memory_map),
"--read0",
*self._fzf_options,
delimiter="\0",
)
assert isinstance(picked_l, list)
if picked_l:
picked: T = memory_map[picked_l[0]]
key = self._prompt_keys(picked)
self.deny(key, getattr(picked, key), write=True)
click.echo(f"Added {self._deny_raw_list[-1]} to denylist", err=True)
else:
click.echo("No item selected", err=True)
prompt_continue = click.confirm("Continue?")

267
my/core/discovery_pure.py Normal file
View file

@ -0,0 +1,267 @@
'''
The idea of this module is to avoid imports of external HPI modules and code evaluation via ast module etc.
This potentially allows it to be:
- robust: can discover modules that can't be imported, generally makes it foolproof
- faster: importing is slow and with tens of modules can be noteiceable
- secure: can be executed in a sandbox & used during setup
It should be free of external modules, importlib, exec, etc. etc.
'''
from __future__ import annotations
REQUIRES = 'REQUIRES'
NOT_HPI_MODULE_VAR = '__NOT_HPI_MODULE__'
###
import ast
import logging
import os
import re
from collections.abc import Iterable, Sequence
from pathlib import Path
from typing import Any, NamedTuple, Optional, cast
'''
None means that requirements weren't defined (different from empty requirements)
'''
Requires = Optional[Sequence[str]]
class HPIModule(NamedTuple):
name: str
skip_reason: str | None
doc: str | None = None
file: Path | None = None
requires: Requires = None
legacy: str | None = None # contains reason/deprecation warning
def ignored(m: str) -> bool:
excluded = [
# legacy stuff left for backwards compatibility
'core.*',
'config.*',
]
exs = '|'.join(excluded)
return re.match(f'^my.({exs})$', m) is not None
def has_stats(src: Path) -> bool:
# todo make sure consistent with get_stats?
return _has_stats(src.read_text())
def _has_stats(code: str) -> bool:
a: ast.Module = ast.parse(code)
for x in a.body:
try: # maybe assign
[tg] = cast(Any, x).targets
if tg.id == 'stats':
return True
except:
pass
try: # maybe def?
name = cast(Any, x).name
if name == 'stats':
return True
except:
pass
return False
def _is_not_module_src(src: Path) -> bool:
a: ast.Module = ast.parse(src.read_text())
return _is_not_module_ast(a)
def _is_not_module_ast(a: ast.Module) -> bool:
marker = NOT_HPI_MODULE_VAR
return any(
getattr(node, 'name', None) == marker # direct definition
or any(getattr(n, 'name', None) == marker for n in getattr(node, 'names', [])) # import from
for node in a.body
)
def _is_legacy_module(a: ast.Module) -> bool:
marker = 'handle_legacy_import'
return any(
getattr(node, 'name', None) == marker # direct definition
or any(getattr(n, 'name', None) == marker for n in getattr(node, 'names', [])) # import from
for node in a.body
)
# todo should be defensive? not sure
def _extract_requirements(a: ast.Module) -> Requires:
# find the assignment..
for x in a.body:
if not isinstance(x, ast.Assign):
continue
tg = x.targets
if len(tg) != 1:
continue
t = tg[0]
# could be Subscript.. so best to keep dynamic
id_ = getattr(t, 'id', None)
if id_ != REQUIRES:
continue
vals = x.value
# could be List/Tuple/Set?
elts = getattr(vals, 'elts', None)
if elts is None:
continue
deps = []
for c in elts:
if isinstance(c, ast.Constant):
deps.append(c.value)
elif isinstance(c, ast.Str):
deps.append(c.s)
else:
raise RuntimeError(f"Expecting string constants only in {REQUIRES} declaration")
return tuple(deps)
return None
# todo should probably be more defensive..
def all_modules() -> Iterable[HPIModule]:
"""
Return all importable modules under all items in the 'my' namespace package
Note: This returns all modules under all roots - if you have
several overlays (multiple items in my.__path__ and you've overridden
modules), this can return multiple HPIModule objects with the same
name. It should respect import order, as we're traversing
in my.__path__ order, so module_by_name should still work
and return the correctly resolved module, but all_modules
can have duplicates
"""
for my_root in _iter_my_roots():
yield from _modules_under_root(my_root)
def _iter_my_roots() -> Iterable[Path]:
import my # doesn't import any code, because of namespace package
paths: list[str] = list(my.__path__)
if len(paths) == 0:
# should probably never happen?, if this code is running, it was imported
# because something was added to __path__ to match this name
raise RuntimeError("my.__path__ was empty, try re-installing HPI?")
else:
yield from map(Path, paths)
def _modules_under_root(my_root: Path) -> Iterable[HPIModule]:
"""
Experimental version, which isn't importing the modules, making it more robust and safe.
"""
for f in sorted(my_root.rglob('*.py')):
if f.is_symlink():
continue # meh
mp = f.relative_to(my_root.parent)
if mp.name == '__init__.py':
mp = mp.parent
m = str(mp.with_suffix('')).replace(os.sep, '.')
if ignored(m):
continue
a: ast.Module = ast.parse(f.read_text())
# legacy modules are 'forced' to be modules so 'hpi module install' still works for older modules
# a bit messy, will think how to fix it properly later
legacy_module = _is_legacy_module(a)
if _is_not_module_ast(a) and not legacy_module:
continue
doc = ast.get_docstring(a, clean=False)
requires: Requires = None
try:
requires = _extract_requirements(a)
except Exception as e:
logging.exception(e)
legacy = f'{m} is DEPRECATED. Please refer to the module documentation.' if legacy_module else None
yield HPIModule(
name=m,
skip_reason=None,
doc=doc,
file=f.relative_to(my_root.parent),
requires=requires,
legacy=legacy,
)
def module_by_name(name: str) -> HPIModule:
for m in all_modules():
if m.name == name:
return m
raise RuntimeError(f'No such module: {name}')
### tests
def test() -> None:
# TODO this should be a 'sanity check' or something
assert len(list(all_modules())) > 10 # kinda arbitrary
def test_demo() -> None:
demo = module_by_name('my.demo')
assert demo.doc is not None
assert demo.file == Path('my', 'demo.py')
assert demo.requires is None
def test_excluded() -> None:
for m in all_modules():
assert 'my.core.' not in m.name
def test_requires() -> None:
photos = module_by_name('my.photos.main')
r = photos.requires
assert r is not None
assert len(r) == 2 # fragile, but ok for now
def test_legacy_modules() -> None:
# shouldn't crash
module_by_name('my.reddit')
module_by_name('my.fbmessenger')
def test_pure() -> None:
"""
We want to keep this module clean of other HPI imports
"""
# this uses string concatenation here to prevent
# these tests from testing against themselves
src = Path(__file__).read_text()
# 'import my' is allowed, but
# dont allow anything other HPI modules
assert re.findall('import ' + r'my\.\S+', src, re.MULTILINE) == []
assert 'from ' + 'my' not in src
def test_has_stats() -> None:
assert not _has_stats('')
assert not _has_stats('x = lambda : whatever')
assert _has_stats('''
def stats():
pass
''')
assert _has_stats('''
stats = lambda: "something"
''')
assert _has_stats('''
stats = other_function
''')

284
my/core/error.py Normal file
View file

@ -0,0 +1,284 @@
"""
Various error handling helpers
See https://beepb00p.xyz/mypy-error-handling.html#kiss for more detail
"""
from __future__ import annotations
import traceback
from collections.abc import Iterable, Iterator
from datetime import datetime
from itertools import tee
from typing import (
Any,
Callable,
Literal,
TypeVar,
Union,
cast,
)
from .types import Json
T = TypeVar('T')
E = TypeVar('E', bound=Exception) # TODO make covariant?
ResT = Union[T, E]
Res = ResT[T, Exception]
ErrorPolicy = Literal["yield", "raise", "drop"]
def notnone(x: T | None) -> T:
assert x is not None
return x
def unwrap(res: Res[T]) -> T:
if isinstance(res, Exception):
raise res
return res
def drop_exceptions(itr: Iterator[Res[T]]) -> Iterator[T]:
"""Return non-errors from the iterable"""
for o in itr:
if isinstance(o, Exception):
continue
yield o
def raise_exceptions(itr: Iterable[Res[T]]) -> Iterator[T]:
"""Raise errors from the iterable, stops the select function"""
for o in itr:
if isinstance(o, Exception):
raise o
yield o
def warn_exceptions(itr: Iterable[Res[T]], warn_func: Callable[[Exception], None] | None = None) -> Iterator[T]:
# if not provided, use the 'warnings' module
if warn_func is None:
from my.core.warnings import medium
def _warn_func(e: Exception) -> None:
# TODO: print traceback? but user could always --raise-exceptions as well
medium(str(e))
warn_func = _warn_func
for o in itr:
if isinstance(o, Exception):
warn_func(o)
continue
yield o
def echain(ex: E, cause: Exception) -> E:
ex.__cause__ = cause
return ex
def split_errors(l: Iterable[ResT[T, E]], ET: type[E]) -> tuple[Iterable[T], Iterable[E]]:
# TODO would be nice to have ET=Exception default? but it causes some mypy complaints?
vit, eit = tee(l)
# TODO ugh, not sure if I can reconcile type checking and runtime and convince mypy that ET and E are the same type?
values: Iterable[T] = (
r # type: ignore[misc]
for r in vit
if not isinstance(r, ET))
errors: Iterable[E] = (
r
for r in eit
if isinstance(r, ET))
# TODO would be interesting to be able to have yield statement anywehere in code
# so there are multiple 'entry points' to the return value
return (values, errors)
K = TypeVar('K')
def sort_res_by(items: Iterable[Res[T]], key: Callable[[Any], K]) -> list[Res[T]]:
"""
Sort a sequence potentially interleaved with errors/entries on which the key can't be computed.
The general idea is: the error sticks to the non-error entry that follows it
"""
group = []
groups = []
for i in items:
k: K | None
try:
k = key(i)
except Exception: # error white computing key? dunno, might be nice to handle...
k = None
group.append(i)
if k is not None:
groups.append((k, group))
group = []
results: list[Res[T]] = []
for _v, grp in sorted(groups, key=lambda p: p[0]): # type: ignore[return-value, arg-type] # TODO SupportsLessThan??
results.extend(grp)
results.extend(group) # handle last group (it will always be errors only)
return results
def test_sort_res_by() -> None:
class Exc(Exception):
def __eq__(self, other):
return self.args == other.args
ress = [
Exc('first'),
Exc('second'),
5,
3,
'bad',
2,
1,
Exc('last'),
]
results = sort_res_by(ress, lambda x: int(x))
assert results == [
1,
'bad',
2,
3,
Exc('first'),
Exc('second'),
5,
Exc('last'),
]
results2 = sort_res_by([*ress, 0], lambda x: int(x))
assert results2 == [Exc('last'), 0] + results[:-1]
assert sort_res_by(['caba', 'a', 'aba', 'daba'], key=lambda x: len(x)) == ['a', 'aba', 'caba', 'daba']
assert sort_res_by([], key=lambda x: x) == []
# helpers to associate timestamps with the errors (so something meaningful could be displayed on the plots, for example)
# todo document it under 'patterns' somewhere...
# todo proper typevar?
def set_error_datetime(e: Exception, dt: datetime | None) -> None:
if dt is None:
return
e.args = (*e.args, dt)
# todo not sure if should return new exception?
def attach_dt(e: Exception, *, dt: datetime | None) -> Exception:
set_error_datetime(e, dt)
return e
# todo it might be problematic because might mess with timezones (when it's converted to string, it's converted to a shift)
def extract_error_datetime(e: Exception) -> datetime | None:
import re
for x in reversed(e.args):
if isinstance(x, datetime):
return x
if not isinstance(x, str):
continue
m = re.search(r'\d{4}-\d\d-\d\d(...:..:..)?(\.\d{6})?(\+.....)?', x)
if m is None:
continue
ss = m.group(0)
# todo not sure if should be defensive??
return datetime.fromisoformat(ss)
return None
def error_to_json(e: Exception) -> Json:
estr = ''.join(traceback.format_exception(Exception, e, e.__traceback__))
return {'error': estr}
MODULE_SETUP_URL = 'https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#private-configuration-myconfig'
def warn_my_config_import_error(
err: ImportError | AttributeError,
*,
help_url: str | None = None,
module_name: str | None = None,
) -> bool:
"""
If the user tried to import something from my.config but it failed,
possibly due to missing the config block in my.config?
Returns True if it matched a possible config error
"""
import re
import click
if help_url is None:
help_url = MODULE_SETUP_URL
if type(err) is ImportError:
if err.name != 'my.config':
return False
# parse name that user attempted to import
em = re.match(r"cannot import name '(\w+)' from 'my.config'", str(err))
if em is not None:
section_name = em.group(1)
click.secho(f"""\
You may be missing the '{section_name}' section from your config.
See {help_url}\
""", fg='yellow', err=True)
return True
elif type(err) is AttributeError:
# test if user had a nested config block missing
# https://github.com/karlicoss/HPI/issues/223
if hasattr(err, 'obj') and hasattr(err, "name"):
config_obj = cast(object, getattr(err, 'obj')) # the object that caused the attribute error
# e.g. active_browser for my.browser
nested_block_name = err.name
errmsg = f"""You're likely missing the nested config block for '{getattr(config_obj, '__name__', str(config_obj))}.{nested_block_name}'.
See {help_url} or check the corresponding module.py file for an example\
"""
if config_obj.__module__ == 'my.config':
click.secho(errmsg, fg='yellow', err=True)
return True
if module_name is not None and nested_block_name == module_name.split('.')[-1]:
# this tries to cover cases like these
# user config:
# class location:
# class via_ip:
# accuracy = 10_000
# then when we import it, we do something like
# from my.config import location
# user_config = location.via_ip
# so if location is present, but via_ip is not, we get
# AttributeError: type object 'location' has no attribute 'via_ip'
click.secho(errmsg, fg='yellow', err=True)
return True
else:
click.echo(f"Unexpected error... {err}", err=True)
return False
def test_datetime_errors() -> None:
import pytz # noqa: I001
dt_notz = datetime.now()
dt_tz = datetime.now(tz=pytz.timezone('Europe/Amsterdam'))
for dt in [dt_tz, dt_notz]:
e1 = RuntimeError('whatever')
assert extract_error_datetime(e1) is None
set_error_datetime(e1, dt=dt)
assert extract_error_datetime(e1) == dt
e2 = RuntimeError(f'something something {dt} something else')
assert extract_error_datetime(e2) == dt
e3 = RuntimeError(str(['one', '2019-11-27T08:56:00', 'three']))
assert extract_error_datetime(e3) is not None
# date only
e4 = RuntimeError(str(['one', '2019-11-27', 'three']))
assert extract_error_datetime(e4) is not None

66
my/core/experimental.py Normal file
View file

@ -0,0 +1,66 @@
from __future__ import annotations
import sys
import types
from typing import Any
# The idea behind this one is to support accessing "overlaid/shadowed" modules from namespace packages
# See usage examples here:
# - https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/util/hpi_heartbeat.py
# - https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/twitter/all.py
# Suppose you want to use my.twitter.talon, which isn't in the default all.py
# You could just copy all.py to your personal overlay, but that would mean duplicating
# all the code and possible upstream changes.
# Alternatively, you could import the "original" my.twitter.all module from "overlay" my.twitter.all
# _ORIG = import_original_module(__name__, __file__)
# this would magically take care of package import path etc,
# and should import the "original" my.twitter.all as _ORIG
# After that you can call its methods, extend etc.
def import_original_module(
module_name: str,
file: str,
*,
star: bool = False,
globals: dict[str, Any] | None = None,
) -> types.ModuleType:
module_to_restore = sys.modules[module_name]
# NOTE: we really wanna to hack the actual package of the module
# rather than just top level my.
# since that would be a bit less disruptive
module_pkg = module_to_restore.__package__
assert module_pkg is not None
parent = sys.modules[module_pkg]
my_path = parent.__path__._path # type: ignore[attr-defined]
my_path_orig = list(my_path)
def fixup_path() -> None:
for i, p in enumerate(my_path_orig):
starts = file.startswith(p)
if i == 0:
# not sure about this.. but I guess it'll always be 0th element?
assert starts, (my_path_orig, file)
if starts:
my_path.remove(p)
# should remove exactly one item
assert len(my_path) + 1 == len(my_path_orig), (my_path_orig, file)
try:
fixup_path()
try:
del sys.modules[module_name]
# NOTE: we're using __import__ instead of importlib.import_module
# since it's closer to the actual normal import (e.g. imports subpackages etc properly )
# fromlist=[None] forces it to return rightmost child
# (otherwise would just return 'my' package)
res = __import__(module_name, fromlist=[None]) # type: ignore[list-item]
if star:
assert globals is not None
globals.update({k: v for k, v in vars(res).items() if not k.startswith('_')})
return res
finally:
sys.modules[module_name] = module_to_restore
finally:
my_path[:] = my_path_orig

82
my/core/freezer.py Normal file
View file

@ -0,0 +1,82 @@
from __future__ import annotations
from .internal import assert_subpackage
assert_subpackage(__name__)
import dataclasses
import inspect
from typing import Any, Generic, TypeVar
D = TypeVar('D')
def _freeze_dataclass(Orig: type[D]):
ofields = [(f.name, f.type, f) for f in dataclasses.fields(Orig)] # type: ignore[arg-type] # see https://github.com/python/typing_extensions/issues/115
# extract properties along with their types
props = list(inspect.getmembers(Orig, lambda o: isinstance(o, property)))
pfields = [(name, inspect.signature(getattr(prop, 'fget')).return_annotation) for name, prop in props]
# FIXME not sure about name?
# NOTE: sadly passing bases=[Orig] won't work, python won't let us override properties with fields
RRR = dataclasses.make_dataclass('RRR', fields=[*ofields, *pfields])
# todo maybe even declare as slots?
return props, RRR
class Freezer(Generic[D]):
'''
Some magic which converts dataclass properties into fields.
It could be useful for better serialization, for performance, for using type as a schema.
For now only supports dataclasses.
'''
def __init__(self, Orig: type[D]) -> None:
self.Orig = Orig
self.props, self.Frozen = _freeze_dataclass(Orig)
def freeze(self, value: D) -> D:
pvalues = {name: getattr(value, name) for name, _ in self.props}
return self.Frozen(**dataclasses.asdict(value), **pvalues) # type: ignore[call-overload] # see https://github.com/python/typing_extensions/issues/115
### tests
# this needs to be defined here to prevent a mypy bug
# see https://github.com/python/mypy/issues/7281
@dataclasses.dataclass
class _A:
x: Any
# TODO what about error handling?
@property
def typed(self) -> int:
return self.x['an_int']
@property
def untyped(self):
return self.x['an_any']
def test_freezer() -> None:
val = _A(x={
'an_int': 123,
'an_any': [1, 2, 3],
})
af = Freezer(_A)
fval = af.freeze(val)
fd = vars(fval)
assert fd['typed'] == 123
assert fd['untyped'] == [1, 2, 3]
###
# TODO shit. what to do with exceptions?
# e.g. good testcase is date parsing issue. should def yield Exception in this case
# fundamentally it should just be Exception aware, dunno
#
# TODO not entirely sure if best to use Frozen as the schema, or actually convert objects..
# guess need to experiment and see

260
my/core/hpi_compat.py Normal file
View file

@ -0,0 +1,260 @@
"""
Contains various backwards compatibility/deprecation helpers relevant to HPI itself.
(as opposed to .compat module which implements compatibility between python versions)
"""
from __future__ import annotations
import inspect
import os
import re
from collections.abc import Iterator, Sequence
from types import ModuleType
from typing import TypeVar
from . import warnings
def handle_legacy_import(
parent_module_name: str,
legacy_submodule_name: str,
parent_module_path: list[str],
) -> bool:
###
# this is to trick mypy into treating this as a proper namespace package
# should only be used for backwards compatibility on packages that are convernted into namespace & all.py pattern
# - https://www.python.org/dev/peps/pep-0382/#namespace-packages-today
# - https://github.com/karlicoss/hpi_namespace_experiment
# - discussion here https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/269946944
from pkgutil import extend_path
parent_module_path[:] = extend_path(parent_module_path, parent_module_name)
# 'this' source tree ends up first in the pythonpath when we extend_path()
# so we need to move 'this' source tree towards the end to make sure we prioritize overlays
parent_module_path[:] = parent_module_path[1:] + parent_module_path[:1]
###
# allow stuff like 'import my.module.submodule' and such
imported_as_parent = False
# allow stuff like 'from my.module import submodule'
importing_submodule = False
# some hacky traceback to inspect the current stack
# to see if the user is using the old style of importing
for f in inspect.stack():
# seems that when a submodule is imported, at some point it'll call some internal import machinery
# with 'parent' set to the parent module
# if parent module is imported first (i.e. in case of deprecated usage), it won't be the case
args = inspect.getargvalues(f.frame)
if args.locals.get('parent') == parent_module_name:
imported_as_parent = True
# this we can only detect from the code I guess
line = '\n'.join(f.code_context or [])
if re.match(rf'from\s+{parent_module_name}\s+import\s+{legacy_submodule_name}', line):
importing_submodule = True
# click sets '_HPI_COMPLETE' env var when it's doing autocompletion
# otherwise, the warning will be printed every time you try to tab complete
autocompleting_module_cli = "_HPI_COMPLETE" in os.environ
is_legacy_import = not (imported_as_parent or importing_submodule)
if is_legacy_import and not autocompleting_module_cli:
warnings.high(
f'''\
importing {parent_module_name} is DEPRECATED! \
Instead, import from {parent_module_name}.{legacy_submodule_name} or {parent_module_name}.all \
See https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy for more info.
'''
)
return is_legacy_import
def pre_pip_dal_handler(
name: str,
e: ModuleNotFoundError,
cfg,
requires: Sequence[str] = (),
) -> ModuleType:
'''
https://github.com/karlicoss/HPI/issues/79
'''
if e.name != name:
# the module itself was imported, so the problem is with some dependencies
raise e
try:
dal = _get_dal(cfg, name)
warnings.high(
f'''
Specifying modules' dependencies in the config or in my/config/repos is deprecated!
Please install {' '.join(requires)} as PIP packages (see the corresponding README instructions).
'''.strip(),
stacklevel=2,
)
except ModuleNotFoundError:
dal = None
if dal is None:
# probably means there was nothing in the old config in the first place
# so we should raise the original exception
raise e
return dal
def _get_dal(cfg, module_name: str):
mpath = getattr(cfg, module_name, None)
if mpath is not None:
from .utils.imports import import_dir
return import_dir(mpath, '.dal')
else:
from importlib import import_module
return import_module(f'my.config.repos.{module_name}.dal')
V = TypeVar('V')
# named to be kinda consistent with more_itertools, e.g. more_itertools.always_iterable
class always_supports_sequence(Iterator[V]):
"""
Helper to make migration from Sequence/List to Iterable/Iterator type backwards compatible in runtime
"""
def __init__(self, it: Iterator[V]) -> None:
self._it = it
self._list: list[V] | None = None
self._lit: Iterator[V] | None = None
def __iter__(self) -> Iterator[V]: # noqa: PYI034
if self._list is not None:
self._lit = iter(self._list)
return self
def __next__(self) -> V:
if self._list is not None:
assert self._lit is not None
delegate = self._lit
else:
delegate = self._it
return next(delegate)
def __getattr__(self, name):
return getattr(self._it, name)
@property
def _aslist(self) -> list[V]:
if self._list is None:
qualname = getattr(self._it, '__qualname__', '<no qualname>') # defensive just in case
warnings.medium(f'Using {qualname} as list is deprecated. Migrate to iterative processing or call list() explicitly.')
self._list = list(self._it)
# this is necessary for list constructor to work correctly
# since it's __iter__ first, then tries to compute length and then starts iterating...
self._lit = iter(self._list)
return self._list
def __len__(self) -> int:
return len(self._aslist)
def __getitem__(self, i: int) -> V:
return self._aslist[i]
def test_always_supports_sequence_list_constructor() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
# list constructor is a bit special... it's trying to compute length if it's available to optimize memory allocation
# so, what's happening in this case is
# - sit.__iter__ is called
# - sit.__len__ is called
# - sit.__next__ is called
res = list(sit)
assert res == ['a', 'b', 'c']
assert exhausted == 1
res = list(sit)
assert res == ['a', 'b', 'c']
assert exhausted == 1 # this will iterate over 'cached' list now, so original generator is only exhausted once
def test_always_supports_sequence_indexing() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
assert len(sit) == 3
assert exhausted == 1
assert sit[2] == 'c'
assert sit[1] == 'b'
assert sit[0] == 'a'
assert exhausted == 1
# a few tests to make sure list-like operations are working..
assert list(sit) == ['a', 'b', 'c']
assert [x for x in sit] == ['a', 'b', 'c'] # noqa: C416
assert list(sit) == ['a', 'b', 'c']
assert [x for x in sit] == ['a', 'b', 'c'] # noqa: C416
assert exhausted == 1
def test_always_supports_sequence_next() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
x = next(sit)
assert x == 'a'
assert exhausted == 0
x = next(sit)
assert x == 'b'
assert exhausted == 0
def test_always_supports_sequence_iter() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
for x in sit:
assert x == 'a'
break
x = next(sit)
assert x == 'b'
assert exhausted == 0
x = next(sit)
assert x == 'c'
assert exhausted == 0
for _ in sit:
raise RuntimeError # shouldn't trigger, just exhaust the iterator
assert exhausted == 1

159
my/core/influxdb.py Normal file
View file

@ -0,0 +1,159 @@
'''
TODO doesn't really belong to 'core' morally, but can think of moving out later
'''
from __future__ import annotations
from .internal import assert_subpackage
assert_subpackage(__name__)
from collections.abc import Iterable
from typing import Any
import click
from .logging import make_logger
from .types import Json, asdict
logger = make_logger(__name__)
class config:
db = 'db'
RESET_DEFAULT = False
def fill(it: Iterable[Any], *, measurement: str, reset: bool = RESET_DEFAULT, dt_col: str = 'dt') -> None:
# todo infer dt column automatically, reuse in stat?
# it doesn't like dots, ends up some syntax error?
measurement = measurement.replace('.', '_')
# todo autoinfer measurement?
db = config.db
from influxdb import InfluxDBClient # type: ignore
client = InfluxDBClient()
# todo maybe create if not exists?
# client.create_database(db)
# todo should be it be env variable?
if reset:
logger.warning('deleting measurements: %s:%s', db, measurement)
client.delete_series(database=db, measurement=measurement)
# TODO need to take schema here...
cache: dict[str, bool] = {}
def good(f, v) -> bool:
c = cache.get(f)
if c is not None:
return c
t = type(v)
r = t in {str, int}
cache[f] = r
if not r:
logger.warning('%s: filtering out %s=%s because of type %s', measurement, f, v, t)
return r
def filter_dict(d: Json) -> Json:
return {f: v for f, v in d.items() if good(f, v)}
def dit() -> Iterable[Json]:
for i in it:
d = asdict(i)
tags: Json | None = None
tags_ = d.get('tags') # meh... handle in a more robust manner
if tags_ is not None and isinstance(tags_, dict): # FIXME meh.
del d['tags']
tags = tags_
# TODO what to do with exceptions??
# todo handle errors.. not sure how? maybe add tag for 'error' and fill with empty data?
dt = d[dt_col].isoformat()
del d[dt_col]
fields = filter_dict(d)
yield {
'measurement': measurement,
# TODO maybe good idea to tag with database file/name? to inspect inconsistencies etc..
# hmm, so tags are autoindexed and might be faster?
# not sure what's the big difference though
# "fields are data and tags are metadata"
'tags': tags,
'time': dt,
'fields': fields,
}
from more_itertools import chunked
# "The optimal batch size is 5000 lines of line protocol."
# some chunking is def necessary, otherwise it fails
inserted = 0
for chi in chunked(dit(), n=5000):
chl = list(chi)
inserted += len(chl)
logger.debug('writing next chunk %s', chl[-1])
client.write_points(chl, database=db)
logger.info('inserted %d points', inserted)
# todo "Specify timestamp precision when writing to InfluxDB."?
def magic_fill(it, *, name: str | None = None, reset: bool = RESET_DEFAULT) -> None:
if name is None:
assert callable(it) # generators have no name/module
name = f'{it.__module__}:{it.__name__}'
assert name is not None
if callable(it):
it = it()
from itertools import tee
from more_itertools import first, one
it, x = tee(it)
f = first(x, default=None)
if f is None:
logger.warning('%s has no data', name)
return
# TODO can we reuse pandas code or something?
#
from .pandas import _as_columns
schema = _as_columns(type(f))
from datetime import datetime
dtex = RuntimeError(f'expected single datetime field. schema: {schema}')
dtf = one((f for f, t in schema.items() if t == datetime), too_short=dtex, too_long=dtex)
fill(it, measurement=name, reset=reset, dt_col=dtf)
@click.group()
def main() -> None:
pass
@main.command(name='populate', short_help='populate influxdb')
@click.option('--reset', is_flag=True, help='Reset Influx measurements before inserting', show_default=True)
@click.argument('FUNCTION_NAME', type=str, required=True)
def populate(*, function_name: str, reset: bool) -> None:
from .__main__ import _locate_functions_or_prompt
[provider] = list(_locate_functions_or_prompt([function_name]))
# todo could have a non-interactive version which populates from all data sources for the provider?
magic_fill(provider, reset=reset)
# todo later just add to hpi main?
# not sure if want to couple
if __name__ == '__main__':
main()

73
my/core/init.py Normal file
View file

@ -0,0 +1,73 @@
'''
A hook to insert user's config directory into Python's search path.
Note that this file is imported only if we don't have custom user config (under my.config namespace) in PYTHONPATH
Ideally that would be in __init__.py (so it's executed without having to import explicitly)
But, with namespace packages, we can't have __init__.py in the parent subpackage
(see http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html#the-init-py-trap)
Instead, this is imported in the stub config (in this repository), so if the stub config is used, it triggers import of the 'real' config.
Please let me know if you are aware of a better way of dealing with this!
'''
# separate function to present namespace pollution
def setup_config() -> None:
import sys
import warnings
from pathlib import Path
from .preinit import get_mycfg_dir
mycfg_dir = get_mycfg_dir()
if not mycfg_dir.exists():
warnings.warn(f"""
'my.config' package isn't found! (expected at '{mycfg_dir}'). This is likely to result in issues.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""".strip(), stacklevel=1)
return
mpath = str(mycfg_dir)
# NOTE: we _really_ want to have mpath in front there, to shadow my.config stub within this packages
# hopefully it doesn't cause any issues
sys.path.insert(0, mpath)
# remove the stub and reimport the 'real' config
# likely my.config will always be in sys.modules, but defensive just in case
if 'my.config' in sys.modules:
del sys.modules['my.config']
# this should import from mpath now
try:
import my.config
except ImportError as ex:
# just in case... who knows what crazy setup users have
import logging
logging.exception(ex)
warnings.warn(f"""
Importing 'my.config' failed! (error: {ex}). This is likely to result in issues.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""", stacklevel=1)
else:
# defensive just in case -- __file__ may not be present if there is some dynamic magic involved
used_config_file = getattr(my.config, '__file__', None)
if used_config_file is not None:
used_config_path = Path(used_config_file)
try:
# will crash if it's imported from other dir?
used_config_path.relative_to(mycfg_dir)
except ValueError:
# TODO maybe implement a strict mode where these warnings will be errors?
warnings.warn(
f"""
Expected my.config to be located at {mycfg_dir}, but instead its path is {used_config_path}.
This will likely cause issues down the line -- double check {mycfg_dir} structure.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""", stacklevel=1
)
setup_config()
del setup_config

9
my/core/internal.py Normal file
View file

@ -0,0 +1,9 @@
"""
Utils specific to hpi core, shouldn't really be used by HPI modules
"""
def assert_subpackage(name: str) -> None:
# can lead to some unexpected issues if you 'import cachew' which being in my/core directory.. so let's protect against it
# NOTE: if we use overlay, name can be smth like my.origg.my.core.cachew ...
assert name == '__main__' or 'my.core' in name, f'Expected module __name__ ({name}) to be __main__ or start with my.core'

17
my/core/kompress.py Normal file
View file

@ -0,0 +1,17 @@
from .internal import assert_subpackage
assert_subpackage(__name__)
from . import warnings
# do this later -- for now need to transition modules to avoid using kompress directly (e.g. ZipPath)
# warnings.high('my.core.kompress is deprecated, please use "kompress" library directly. See https://github.com/karlicoss/kompress')
try:
from kompress import *
except ModuleNotFoundError as e:
if e.name == 'kompress':
warnings.high('Please install kompress (pip3 install kompress). Falling onto vendorized kompress for now.')
from ._deprecated.kompress import * # type: ignore[assignment]
else:
raise e

254
my/core/konsume.py Normal file
View file

@ -0,0 +1,254 @@
'''
Some experimental JSON parsing, basically to ensure that all data is consumed.
This can potentially allow both for safer defensive parsing, and let you know if the data started returning more data
TODO perhaps need to get some inspiration from linear logic to decide on a nice API...
'''
from __future__ import annotations
from collections import OrderedDict
from typing import Any
def ignore(w, *keys):
for k in keys:
w[k].ignore()
def zoom(w, *keys):
return [w[k].zoom() for k in keys]
# TODO need to support lists
class Zoomable:
def __init__(self, parent, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.parent = parent
# TODO not sure, maybe do it via del??
# TODO need to make sure they are in proper order? object should be last..
@property
def dependants(self):
raise NotImplementedError
def ignore(self) -> None:
self.consume_all()
def consume_all(self) -> None:
for d in self.dependants:
d.consume_all()
self.consume()
def consume(self) -> None:
assert self.parent is not None
self.parent._remove(self)
def zoom(self) -> Zoomable:
self.consume()
return self
def _remove(self, xx):
raise NotImplementedError
def this_consumed(self):
raise NotImplementedError
class Wdict(Zoomable, OrderedDict):
def _remove(self, xx):
keys = [k for k, v in self.items() if v is xx]
assert len(keys) == 1
del self[keys[0]]
@property
def dependants(self):
return list(self.values())
def this_consumed(self):
return len(self) == 0
# TODO specify mypy type for the index special method?
class Wlist(Zoomable, list):
def _remove(self, xx):
self.remove(xx)
@property
def dependants(self):
return list(self)
def this_consumed(self):
return len(self) == 0
class Wvalue(Zoomable):
def __init__(self, parent, value: Any) -> None:
super().__init__(parent)
self.value = value
@property
def dependants(self):
return []
def this_consumed(self):
return True # TODO not sure..
def __repr__(self):
return 'WValue{' + repr(self.value) + '}'
def _wrap(j, parent=None) -> tuple[Zoomable, list[Zoomable]]:
res: Zoomable
cc: list[Zoomable]
if isinstance(j, dict):
res = Wdict(parent)
cc = [res]
for k, v in j.items():
vv, c = _wrap(v, parent=res)
res[k] = vv
cc.extend(c)
return res, cc
elif isinstance(j, list):
res = Wlist(parent)
cc = [res]
for i in j:
ii, c = _wrap(i, parent=res)
res.append(ii)
cc.extend(c)
return res, cc
elif isinstance(j, (int, float, str, type(None))):
res = Wvalue(parent, j)
return res, [res]
else:
raise RuntimeError(f'Unexpected type: {type(j)} {j}')
from collections.abc import Iterator
from contextlib import contextmanager
class UnconsumedError(Exception):
pass
# TODO think about error policy later...
@contextmanager
def wrap(j, *, throw=True) -> Iterator[Zoomable]:
w, children = _wrap(j)
yield w
for c in children:
if not c.this_consumed(): # TODO hmm. how does it figure out if it's consumed???
if throw:
# TODO need to keep a full path or something...
raise UnconsumedError(f'''
Expected {c} to be fully consumed by the parser.
'''.lstrip())
else:
# TODO log?
pass
from typing import cast
def test_unconsumed() -> None:
import pytest
with pytest.raises(UnconsumedError):
with wrap({'a': 1234}) as w:
w = cast(Wdict, w)
pass
with pytest.raises(UnconsumedError):
with wrap({'c': {'d': 2222}}) as w:
w = cast(Wdict, w)
d = w['c']['d'].zoom()
def test_consumed() -> None:
with wrap({'a': 1234}) as w:
w = cast(Wdict, w)
a = w['a'].zoom()
with wrap({'c': {'d': 2222}}) as w:
w = cast(Wdict, w)
c = w['c'].zoom()
d = c['d'].zoom()
def test_types() -> None:
# (string, number, object, array, boolean or nul
with wrap({'string': 'string', 'number': 3.14, 'boolean': True, 'null': None, 'list': [1, 2, 3]}) as w:
w = cast(Wdict, w)
w['string'].zoom()
w['number'].consume()
w['boolean'].zoom()
w['null'].zoom()
for x in list(w['list'].zoom()): # TODO eh. how to avoid the extra list thing?
x.consume()
def test_consume_all() -> None:
with wrap({'aaa': {'bbb': {'hi': 123}}}) as w:
w = cast(Wdict, w)
aaa = w['aaa'].zoom()
aaa['bbb'].consume_all()
def test_consume_few() -> None:
import pytest
pytest.skip('Will think about it later..')
with wrap({'important': 123, 'unimportant': 'whatever'}) as w:
w = cast(Wdict, w)
w['important'].zoom()
w.consume_all()
# TODO hmm, we want smth like this to work..
def test_zoom() -> None:
import pytest
with wrap({'aaa': 'whatever'}) as w:
w = cast(Wdict, w)
with pytest.raises(KeyError):
w['nosuchkey'].zoom()
w['aaa'].zoom()
# TODO type check this...
# TODO feels like the whole thing kind of unnecessarily complex
# - cons:
# - in most cases this is not even needed? who cares if we miss a few attributes?
# - pro: on the other hand it could be interesting to know about new attributes in data,
# and without this kind of processing we wouldn't even know
# alternatives
# - manually process data
# e.g. use asserts, dict.pop and dict.values() methods to unpack things
# - pros:
# - very simple, since uses built in syntax
# - very performant, as fast as it gets
# - very flexible, easy to adjust behaviour
# - cons:
# - can forget to assert about extra entities etc, so error prone
# - if we do something like =assert j.pop('status') == 200, j=, by the time assert happens we already popped item -- makes error handling harder
# - a bit verbose.. so probably requires some helper functions though (could be much leaner than current konsume though)
# - if we assert, then terminates parsing too early, if we're defensive then inflates the code a lot with if statements
# - TODO perhaps combine warnings somehow or at least only emit once per module?
# - hmm actually tbh if we carefully go through everything and don't make copies, then only requires one assert at the very end?
# - TODO this is kinda useful? https://discuss.python.org/t/syntax-for-dictionnary-unpacking-to-variables/18718
# operator.itemgetter?
# - TODO can use match operator in python for this? quite nice actually! and allows for dynamic behaviour
# only from 3.10 tho, and gonna be tricky to do dynamic defensive behaviour with this
# - TODO in a sense, blenser already would hint if some meaningful fields aren't being processed? only if they are changing though
# - define a "schema" for data, then just recursively match data against the schema?
# possibly pydantic already does something like that? not sure about performance though
# pros:
# - much simpler to extend and understand what's going on
# cons:
# - more rigid, so it becomes tricky to do dynamic stuff (e.g. if schema actually changes)

Some files were not shown because too many files have changed in this diff Show more