Compare commits

..

226 commits

Author SHA1 Message Date
Dima Gerasimov
bb703c8c6a twitter.android: fix get_own_user_id for latest exports 2024-12-29 15:48:15 +00:00
Dima Gerasimov
54df429f61 core.sqlite: add helper SqliteTool to get table schemas 2024-12-29 15:16:03 +00:00
purarue
f1d23c5e96 smscalls: allow large XML files as input
once XML files increase past a certain size
(was about 220MB for me), the parser just throws
an error because the tree is too large (iirc for
security reasons)

could maybe look at using iterparse in the future
to parse it without loading the whole file, but this
seems to fix it fine for me
2024-12-28 21:46:28 +00:00
purarue
d8c53bde34 smscalls: add phone number to model 2024-11-26 21:53:52 +00:00
purarue
95a16b956f
doc: some performance notes for query_range (#409)
* doc: some performance notes for query_range
* add ruff_cache to gitignore
2024-11-26 21:53:10 +00:00
purarue
a7f05c2cad doc: spelling fixes 2024-11-26 21:51:40 +00:00
Srajan Garg
ad55c5c345
fix typo in rexport DAL (#405)
* fix typo in rexport DAL
2024-11-13 00:05:27 +00:00
purarue
7ab6f0d5cb chore: update urls 2024-10-30 20:12:00 +00:00
Dima Gerasimov
a2b397ec4a my.whatsapp.android: adapt to new db format 2024-10-22 21:35:52 +01:00
Dima Gerasimov
8496d131e7 general: migrate modules to use 3.9 features 2024-10-19 23:41:22 +01:00
karlicoss
d3f9a8e8b6
core: migrate code to benefit from 3.9 stuff (#401)
for now keeping ruff on 3.8 target version, need to sort out modules as well
2024-10-19 20:55:09 +01:00
Dima Gerasimov
bc7c3ac253 general: python3.9 reached EOL, switch min version
also enable 3.13 on CI
2024-10-19 18:58:17 +01:00
Dima Gerasimov
a8f86e32b9 core.time: hotfix for default force_abbreviations attribute 2024-09-23 22:04:41 +01:00
Dima Gerasimov
6a6d157040 cli: fix minor race condition in creating hpi_temp_dir 2024-09-23 01:22:16 +01:00
Dima Gerasimov
bf8af6c598 tox: try using uv for CI, should result in speedup
see https://github.com/karlicoss/HPI/issues/391
2024-09-23 01:22:16 +01:00
Dima Gerasimov
8ed9e1947e my.youtube.takeout: deduplicate watched videos and sort out a few minor errors 2024-09-22 23:46:41 +01:00
Dima Gerasimov
75639a3d5e tox: some prep for potentially using uv on CI instead of pip
see https://github.com/karlicoss/HPI/issues/391
2024-09-22 20:10:52 +01:00
Dima Gerasimov
3166109f15 my.core: fix list constructor in always_support_sequence and add some tests 2024-09-22 04:35:30 +01:00
Dima Gerasimov
02dabe9f2b my.twitter.archive: cleanup linting and use proper configuration via abstract class 2024-09-22 02:13:10 +01:00
Dima Gerasimov
239e6617fe my.twitter.archive: deduplicate tweets based on id_str/created_at and raw tweet text 2024-09-22 02:13:10 +01:00
Dima Gerasimov
e036cc9e85 my.twitter.android: get own user id as string, consistent with rest of module 2024-09-22 02:13:10 +01:00
Dima Gerasimov
2ca323da84 my.fbmessenger.android: exclude unsent messages to avoid duplication 2024-09-21 23:25:25 +01:00
Dima Gerasimov
6a18f47c37 my.github.gdpr/my.zulip.organization: use kompress support for tar.gz if it's available
otherwise fall back onto unpacking into tmp dir via my.core.structure
2024-09-18 23:35:03 +01:00
Dima Gerasimov
201ddd4d7c my.core.structure: add support for .tar.gz archives
this will be useful to migrate .tar.gz processing to kompress in a backwards compatible way, or to run them against unpacked folder structure if user prefers
2024-09-17 00:25:17 +01:00
Dima Gerasimov
27178c0939 my.google.takeout.parser: speedup event merging on newer google_takeout_parser versions 2024-09-13 02:31:12 +01:00
Dima Gerasimov
71fdeca5e1 ci: update mypy config and make ruff config more consistent with other projects 2024-08-31 02:17:49 +01:00
Dima Gerasimov
d58453410c ruff: process remaining existing checks and suppress the annoying ones 2024-08-28 04:06:32 +01:00
Dima Gerasimov
1c5efc46aa ruff: enable TRY rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
affa79ba3a my.time.tz.via_location: fix accidental RuntimeError introduced in previous MR 2024-08-28 04:06:32 +01:00
Dima Gerasimov
fc0e0be291 ruff: enable ICN and PD rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
c5df3ce128 ruff: enable W, COM, EXE rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
ac08af7aab ruff: enable PT (pytest) rules 2024-08-28 04:06:32 +01:00
Dima Gerasimov
9fd4227abf ruff: enable RET/PIE/PLW 2024-08-28 04:06:32 +01:00
Dima Gerasimov
bd1e5d2f11 ruff: enable PERF checks set 2024-08-28 04:06:32 +01:00
Dima Gerasimov
985c0f94e6 ruff: attempt to enable ARG checks, suppress in some places 2024-08-28 04:06:32 +01:00
Dima Gerasimov
72cc8ff3ac ruff: enable B warnings (mainly suppressed exceptions and unused variables) 2024-08-28 04:06:32 +01:00
Dima Gerasimov
d0df8e8f2d ruff: enable PLR rules and fix bug in my.github.gdpr._is_bot 2024-08-28 04:06:32 +01:00
Dima Gerasimov
b594377a59 ruff: enable RUF ruleset 2024-08-28 04:06:32 +01:00
Dima Gerasimov
664c40e3e8 ruff: enable FBT rules to detect boolean arguments use without kwargs 2024-08-28 04:06:32 +01:00
Dima Gerasimov
118c2d4484 ruff: enable UP ruleset for detecting python deprecations 2024-08-28 04:06:32 +01:00
Dima Gerasimov
d244c7cc4e ruff: enable and fix C4 ruleset 2024-08-28 04:06:32 +01:00
Dima Gerasimov
c08ddbc781 general: small updates for typing while trying out pyright 2024-08-28 04:06:32 +01:00
Dima Gerasimov
b1fe23b8d0 my.rss.feedly/my.twittr.talon -- migrate to use lazy user configs 2024-08-26 04:00:58 +01:00
Dima Gerasimov
b87d1c970a tests: move remaining tests from tests/ to my.tests, cleanup corresponding modules 2024-08-26 04:00:58 +01:00
Dima Gerasimov
a5643206a0 general: make time.tz.via_location user config lazy, move tests to my.tests package
also gets rid of the problematic reset_modules thingie
2024-08-26 04:00:58 +01:00
Dima Gerasimov
270080bd56 core.error: better defensive handling for my.core.source when parts of config are missing 2024-08-26 04:00:58 +01:00
Dima Gerasimov
094519acaf tests: disable cachew in my.tests subpackage 2024-08-26 04:00:58 +01:00
Dima Gerasimov
7cae9d5bf3 my.google.takeout.paths: migrate to new style lazy config
also clean up tests a little and move into my.tests.location.google
2024-08-26 04:00:58 +01:00
Dima Gerasimov
2ff2dcfc00 tests: move test checkign for my_config handling to core/tests/test_config.py
allows to remove the hacky reset_modules thing from setup fixture
2024-08-25 20:49:56 +01:00
Dima Gerasimov
1215181af5 core: move stuff from tests/demo.py to my/core/tests/test_config.py
also clean all this up a bit
2024-08-25 20:49:56 +01:00
Dima Gerasimov
5a67f0bafe pdfs: migrate config to Protocol with properties
allowes to remove a whole bunch of hacky crap from tests!
2024-08-25 20:49:56 +01:00
Dima Gerasimov
d154825591 my.bluemaestro: make config construction lazy
following the discussions here: https://github.com/karlicoss/HPI/issues/46#issuecomment-2295464073
2024-08-25 20:49:56 +01:00
Dima Gerasimov
9f017fb29b my.core.pandas: add more tests 2024-08-20 00:15:15 +01:00
karlicoss
5ec357915b core.common: add test for classproperty 2024-08-17 13:05:56 +01:00
karlicoss
245ad22057 core.common: bring back asdict backwards compat -- was used in orger 2024-08-17 13:05:56 +01:00
Dima Gerasimov
7bfce72b7c core: cleanup/sort imports according to ruff check --select I 2024-08-16 11:38:13 +01:00
Dima Gerasimov
7023088d13 core.common: deprecate outdated LazyLogger alias 2024-08-16 10:22:29 +01:00
Dima Gerasimov
614c929f95 core.common: move Json, datetime_aware, datetime_naive, is_namedtuple, asdict to my.core.types 2024-08-16 10:22:29 +01:00
Dima Gerasimov
2b0f92c883 my.core: deprecate Path/dataclass imports from my.core during type checking
runtime still works for backwards compatibility
2024-08-16 10:22:29 +01:00
Dima Gerasimov
7f8a502310 core.common: move assert_subpackage to my.core.internal 2024-08-16 10:22:29 +01:00
Dima Gerasimov
88f3c17c27 core.common: move mime-related stuff to my.core.mime
no backward compat, unlikely it was used by anyone else
2024-08-16 10:22:29 +01:00
Dima Gerasimov
c45c51af22 core.common: move stats-related stuff to my.core.stats and add more thorough tests/docs
deprecate core.common.stat and core.common.Stats with backwards compatibility
2024-08-16 10:22:29 +01:00
Dima Gerasimov
18529257e7 core.common: move DummyExecutor to core.common.utils.concurrent
without backwards compat, unlikely it's been used by anyone
2024-08-16 10:22:29 +01:00
Dima Gerasimov
bcc4c15304 core: cleanup my.core.common.unique_everseen
- move to my.core.utils.itertools
- more robust check for hashable types -- now checks in runtime (since the one based on types purely isn't necessarily sound)
- add more testing
2024-08-16 10:22:29 +01:00
Dima Gerasimov
06084a8787 my.core.common: move warn_if_empty to my.core.utils.itertools, cleanup and add more tests 2024-08-16 10:22:29 +01:00
Dima Gerasimov
770dba5506 core.common: move away import related stuff to my.core.utils.imports
moving without backward compatibility, since it's extremely unlikely they are used for any external modules

in fact, unclear if these methods still have much value at all, but keeping for now just in case
2024-08-16 10:22:29 +01:00
Dima Gerasimov
66c08a6c80 core.common: move listify to core.utils.itertools, use better typing annotations for it
also some minor refactoring of my.rss
2024-08-16 10:22:29 +01:00
Dima Gerasimov
c64d7f5b67 core: cleanup itertool style helpers
- deprecate group_by_key, should use itertool.bucket instead
- move make_dict and ensure_unique to my.core.utils.itertools
2024-08-16 10:22:29 +01:00
Dima Gerasimov
973c4205df core: cleanup deprecations, exclude from type checking and show runtime warnings
among affected things:

- core.common.assert_never
- core.common.cproperty
- core.common.isoparse
- core.common.mcachew
- core.common.the
- core.common.tzdatetime
- core.compat.sqlite_backup
2024-08-16 10:22:29 +01:00
Dima Gerasimov
a7439c7846 general: move assert_never to my.core.compat as it's in stdlib from 3.11
rely on typing-extensions for fallback

introducing typing-extensions dependency without fallback, should be ok since it's in the top 10 of popular packages
2024-08-16 10:22:29 +01:00
Dima Gerasimov
1317914bff general: add 'destructive parsing' (kinda what we were doing in my.core.konsume) to my.experimental
also some cleanup for my.codeforces and my.topcoder
2024-08-12 13:24:28 +01:00
Dima Gerasimov
1e1e8d8494 my.topcoder: get rid of kjson in favor of using builtin dict methods 2024-08-12 13:24:28 +01:00
Dima Gerasimov
069264ce52 core.common: get rid of deprecated utcfromtimestamp 2024-08-10 17:46:30 +01:00
Dima Gerasimov
c69a0b43ba my.vk.favorites: some minor cleanup 2024-08-10 17:46:30 +01:00
Dima Gerasimov
34593c032d tests: move more tests into core, more consistent tests running in tox 2024-08-07 01:08:39 +01:00
Dima Gerasimov
074e24c309 general: deprecate my.core.dataset and simplify tox file 2024-08-07 01:08:39 +01:00
Dima Gerasimov
fb8e9909a4 tests: simplify tests for my.core.serialize a bit and simplify tox file 2024-08-07 01:08:39 +01:00
Dima Gerasimov
3aebc573e8 tests: use updated conftest from pymplate, this allows to run individual test modules properly
e.g. pytest --pyargs my.core.tests.test_get_files
2024-08-06 20:55:16 +01:00
Dima Gerasimov
b615ba10b1 ci: temporary suppress pandas mypy error in check_dateish 2024-08-05 23:35:24 +01:00
Dima Gerasimov
2c63fe25c0 my.twitter.android: get data from statues table rather that timeline_view 2024-08-05 23:35:24 +01:00
Dima Gerasimov
652ee9b875 fbmessenger.android: fix minor issue with processing thread participants 2024-08-03 19:01:51 +01:00
Dima Gerasimov
9e72672b4f legacy google takeout: fix timezone localization 2024-08-03 16:50:09 +01:00
karlicoss
d5fccf1874 twitter.android: more comments on timeline types 2024-08-03 16:50:09 +01:00
Dima Gerasimov
0e6dd32afe ci: minor fixes after mypy update 2024-08-03 16:18:32 +01:00
Dima Gerasimov
c9c0e19543 my.instagram.gdpr: fix for new format 2024-08-03 16:18:32 +01:00
seanbreckenridge
35dd5d82a0
smscalls: parse mms from smscalls export (#370)
* initial mms exploration
2024-06-05 22:03:03 +01:00
Dima Gerasimov
8a8a1ebb0e my.tinder.android: better error handing and fix case with empty db 2024-04-03 20:13:40 +01:00
Dima Gerasimov
103ea2096e my.coding.commits: fix for git repo discovery after fdfind v9 2024-03-13 00:46:18 +00:00
Dima Gerasimov
751ed02f43 tests: pin pytest version to <8 for now, having some test collection errors
https://docs.pytest.org/en/stable/changelog.html#collection-changes
2024-03-13 00:46:18 +00:00
Dima Gerasimov
477b7e8fd3 docs: minor update to overlays docs 2024-03-13 00:46:18 +00:00
Dima Gerasimov
0f3d09915c ci: update actions versions 2024-03-13 00:46:18 +00:00
Dima Gerasimov
7236024c7a my.twitter.android: better detection of own user id 2024-03-13 00:46:18 +00:00
Dima Gerasimov
87a8a7781b my.google.maps: intitial module for extracting placed data from Android app 2024-01-01 23:46:02 +00:00
Sean Breckenridge
93e475795d google takeout: support multiple locales
uses the known locales in google_takeout_parser
to determine the expected paths for each locale,
and performs a partial match on the paths to
detect and use match_structure
2023-12-31 18:57:30 +00:00
Dima Gerasimov
1b187b2c1b whatsapp.android: expose all entities extracted from the db 2023-12-29 00:57:49 +00:00
Dima Gerasimov
3ec362fce9 fbmessenger.android: expose contacts 2023-12-28 18:13:16 +00:00
karlicoss
a0ce666024 my.youtube.takeout: fix exception handling 2023-12-28 00:25:05 +00:00
karlicoss
1c452b12d4 twitter.android: extract likes and own tweets as well 2023-12-28 00:12:39 +00:00
karlicoss
51209c547e my.twitter.android: refactor into a proper module
for now only extracting bookmarks, will use it for some time and see how it goes
2023-12-24 00:49:07 +00:00
karlicoss
a4a7bc41b9 my.twitter.android: extract entities 2023-12-24 00:49:07 +00:00
karlicoss
3d75abafe9 my.twitter.android: some intial work on pasring sqlite databases from official Android app 2023-12-24 00:49:07 +00:00
Dima Gerasimov
a8f8858cb1 docs: document more experiments with overlays in docs 2023-12-22 02:54:36 +00:00
Dima Gerasimov
adbc0e73a2 docs: add note about directly checking overlays with mypy 2023-12-22 02:54:36 +00:00
Dima Gerasimov
84d835962d docs: some documentation/thoughts on properly implementing overlay packages 2023-12-20 02:51:27 +00:00
Sean Breckenridge
224ba521e3 gpslogger: catch broken xml file error 2023-12-20 02:41:52 +00:00
Dima Gerasimov
a843407e40 core/compat: move fromisoformat to .core.compat module 2023-11-19 23:45:08 +00:00
karlicoss
09e0f66892 tox: disable --parallel flag in hpi module install
It's been so flaky it ends up taking more time to merge stuff. See https://github.com/karlicoss/HPI/issues/306
2023-11-19 19:18:19 +00:00
Dima Gerasimov
bde43d6a7a my.body.sleep: massive speedup for average temperature calculation 2023-11-11 00:42:49 +00:00
karlicoss
37643c098f tox: remove cat coverage index from tox, it's not very useful anyway 2023-11-10 23:11:54 +00:00
karlicoss
7b1cec9326 codeforces/topcode: move to top level and check in ci 2023-11-10 23:11:54 +00:00
karlicoss
657ce08ac8 fix mypy issues after mypy/libraries updates 2023-11-10 22:59:09 +00:00
karlicoss
996169aa29 time.tz.via_location: more consistent behaviour wrt caching
previously it was possible to cachew never properly initialize the cache because if you only queried some dates in the past
because we never made it to the end of _iter_tzs

also some minor cleanup
2023-11-10 22:59:09 +00:00
karlicoss
70bb9ed0c5 location.google_takeout_semantic: handle None visitConfidence 2023-11-10 02:10:30 +00:00
karlicoss
65c617ed94 my.emfit: add missing properties to fake data generator 2023-11-10 02:10:30 +00:00
karlicoss
ac5f71c68b my.jawbone: get rid of matplotlib import on top level 2023-11-10 02:10:30 +00:00
karlicoss
e547acfa59 general: update minimal cachew version
had quite a few useful fixes/performance optimizations since
2023-11-07 21:24:56 +00:00
karlicoss
33f8d867e2 my.browser.export: cleanup
- make logging INFO (default) -- otherwise it's too quiet during processing lots of databases
- can pass inputs cachew directly now
2023-11-07 21:24:56 +00:00
karlicoss
19353e996d my.hackernews.harmonic: use orjson + add __hash__ for Saved object
plus some minor cleanup
2023-11-07 01:03:57 +00:00
karlicoss
4ac3bbb101 my.bumble.android: fix message deduplication 2023-11-07 01:03:57 +00:00
karlicoss
5630621ec1 my.pinboard: some cleanup 2023-11-06 23:10:00 +00:00
karlicoss
7631f1f2e4 monzo.monzoexport: initial module 2023-11-02 00:47:13 +00:00
karlicoss
105928238f vk_messages_backup: some cleanup + switch to get_files 2023-11-02 00:43:10 +00:00
Dima Gerasimov
24da04f142 ci: fix wrong release command 2023-11-01 01:54:16 +00:00
karlicoss
71cb66df5f core: add helper for more_iterable to check that all types involved are hashable
Otherwise unique_everseen performance may degrade to quadratic rather than linear

For now hidden behind HPI_CHECK_UNIQUE_EVERSEEN flag

also switch some modules to use it
2023-10-31 01:02:17 +00:00
Dima Gerasimov
d6786084ca general: deprecate some old methods by hiding behind TYPE_CHECKING 2023-10-30 22:51:31 +00:00
karlicoss
79ce8e84ec fbmessenger.android: support processing msys database
seems that threads_db2 stopped updating some time ago, and msys contains all new data now
2023-10-30 02:54:22 +00:00
karlicoss
f28f68b14b general: enhancle logging for various modules 2023-10-29 22:32:07 +00:00
karlicoss
ea195e3d17 general: improve logging during file processing in various modules 2023-10-29 01:01:30 +01:00
karlicoss
bd27bd4c24 docs: add documentation on logging during HPI module development 2023-10-29 00:50:22 +01:00
karlicoss
f668208bce my.stackexchange.stexport: small cleanup & stat improvements 2023-10-28 21:33:36 +01:00
Dima Gerasimov
6821fbc2fe core/config: implement a warning if config is imported from the dir other than MY_CONFIG
this should help with identifying setup issues
2023-10-28 20:56:07 +01:00
Dima Gerasimov
edea2c2e75 my.kobo: add hightlights method to return Hightlight objects iteratively
also minor cleanup
2023-10-28 20:06:54 +01:00
Dima Gerasimov
d88a1b9933 my.hypothesis: explose data as iterators instead of lists
also add an adapter to support migrating in backwards compatible manner
2023-10-28 20:06:54 +01:00
Dima Gerasimov
4f7c9b4a71 core: move split compat/legacy modules into hpi_compat and compat 2023-10-28 20:06:54 +01:00
karlicoss
70bf51a125 core/stats: exclude contextmanagers from guess_stats 2023-10-28 00:08:32 +01:00
karlicoss
fb2b3e07de my.emfit: cleanup and pass cpu pool 2023-10-27 23:52:03 +01:00
Dima Gerasimov
32aa87b3ec dcotor: make compileall check a bit more defensive 2023-10-27 02:38:22 +01:00
karlicoss
3a25c9042c my.hackernews.dogsheep: use utc datetime + minor cleanup 2023-10-27 02:38:03 +01:00
karlicoss
bef0423b4f my.zulip.organization: use UTC timestamps, support custom archive names + some cleanup 2023-10-27 02:38:03 +01:00
karlicoss
a0910e798d core.logging: ignore CollapseLogsHandler if we're not attached to a terminal
otherwise fails at os.get_terminal_size
2023-10-25 02:42:52 +01:00
Dima Gerasimov
1f61e853c9 reddit.rexport: experiment with using optional cpu pool (used by all of HPI)
Enabled by the env variable, specifying how many cores to dedicate, e.g.

HPI_CPU_POOL=4 hpi query ...
2023-10-25 02:06:45 +01:00
Dima Gerasimov
a5c04e789a twitter.archive: deduplicate results via json.dumps
this speeds up processing quite a bit, from 40s to 20s for me, plus removes tons of identical outputs

interesting enough, using raw object without json.dumps as key brings unique_everseen to crawl...
2023-10-24 01:54:30 +01:00
Dima Gerasimov
0e94e0a9ea whatsapp.andrdoid: handle most messages types properly 2023-10-24 00:31:34 +01:00
Dima Gerasimov
72ab2603d5 my.whatsapp.android: exclude some dummy messages, minor cleanup 2023-10-24 00:31:34 +01:00
Dima Gerasimov
414b88178f tinder.android: infer user's own name automatically 2023-10-24 00:31:34 +01:00
Dima Gerasimov
f355a55e06 my.instagram.gdpr: process all historic archives + better normalising 2023-10-23 18:42:50 +01:00
Dima Gerasimov
f9a1050ceb my.instagram.android: more defensive error handling 2023-10-23 18:42:50 +01:00
karlicoss
86ea605aec core/stats: enable processing input files, report first and last filename
can be useful for quick investigation/testing setup
2023-10-22 00:47:36 +01:00
karlicoss
c335c0c9d8 core/stats: report datetime of first item in addition to last
quite useful for quickly determining time span of a data source
2023-10-22 00:47:36 +01:00
karlicoss
a60d69fb30 core/stats: get rid of duplicated keys for 'auto stats'
previously:
```
{'iter_data': {'iter_data': {'count': 9, 'last': datetime.datetime(2020, 1, 3, 1, 1, 1)}}}
```

after
```
{'iter_data': {'count': 9, 'last': datetime.datetime(2020, 1, 3, 1, 1, 1)}}
```
2023-10-22 00:47:36 +01:00
karlicoss
c5fe2e9412 core.stats: fix is_data_provider when from __future__ import annotations is used 2023-10-21 23:46:40 +01:00
karlicoss
872053a3c3 my.hackernews.harmonic: fix issue with crashing due to html escaping
also add proper logging
2023-10-21 23:46:40 +01:00
karlicoss
37bb33cdbc experimental: add a hacky helper to import "original/shadowed" modules from within overlays 2023-10-21 22:46:16 +01:00
karlicoss
8c2d1c9463 general: use less explicit kompress boilerplate in modules
now get_files/kompress library can handle it transparently
2023-10-20 21:13:59 +01:00
karlicoss
c63e80ce94 core: more consistent handling of zip archives in get_files + tests 2023-10-20 21:13:59 +01:00
Dima Gerasimov
9ffce1b696 reddit.rexport: add accessors for subreddits, multireddits and profile 2023-10-19 02:26:28 +01:00
Dima Gerasimov
29832a9f75 core: fix test_get_files after updating kompress 2023-10-19 02:26:28 +01:00
Dima Gerasimov
28d2450a21 reddit.rexport: some cleanup, move get_events stuff into personal overlay 2023-10-19 02:26:28 +01:00
karlicoss
fe26efaea8 core/kompress: move vendorized to _deprecated, use kompress library directly 2023-10-12 23:47:05 +01:00
karlicoss
bb478f369d core/logging: no need for super call in Filter 2023-10-12 23:47:05 +01:00
karlicoss
68289c1be3 general: fix ignores after mypy version update 2023-10-12 23:47:05 +01:00
Dima Gerasimov
0512488241 ci: sync configs to pymplate
- add python3.12
- add ruff
2023-10-06 02:24:01 +01:00
Dima Gerasimov
fabcbab751 fix mypy errors after version update 2023-10-02 01:27:49 +01:00
Dima Gerasimov
8cd74a9fc4 ci: attempt to use --parallel flag in tox 2023-10-02 01:27:49 +01:00
Sean Breckenridge
f3507613f0 location: make accuracy default config floats
previously they were ints which could possibly
break caching with cachew
2023-10-01 11:52:41 +01:00
Dima Gerasimov
8addd2d58a new module: Harmonic app for Hackernews 2023-09-25 16:36:21 +01:00
Dima Gerasimov
01480ec8eb core/logging: fix issue with logger setup called multiple times when called with different levels
should resolve https://github.com/karlicoss/HPI/issues/308
2023-09-19 22:39:52 +01:00
Sean Breckenridge
be81466871 browser: fix duplicate logs when fetching loglevel 2023-09-15 01:58:45 +01:00
Sean Breckenridge
2a46341ce2 my.core.logging: compatibility with HPI_LOGS
re-adds a removed check for HPI_LOGS, add some docs

fix the checks for browserexport/takeout logs to
use the computed level from my.core.logging
2023-09-07 02:36:26 +01:00
Sean Breckenridge
ff84d8fc88 core/cli: update vendored completion files
update required click version to 8.1
so we dont regenerate the vendored completions
wrong in the future
2023-09-07 00:01:27 +01:00
Dima Gerasimov
c283e542e3 general: fix some issues after mypy update 2023-08-24 23:46:23 +01:00
Dima Gerasimov
642e3b14d5 my.github.gdpr: some minor enhancements
- better error context
- handle some unknown files
- handle user=None in some cases
- cleanup imports
2023-08-24 23:46:23 +01:00
Dima Gerasimov
7ec894807f my.bumble.android: handle more msg types 2023-08-24 23:46:23 +01:00
Sean Breckenridge
fcaa7c1561 core/cli: allow user to bypass PEP 668
when installing dependencies with 'hpi module install',
this now lets a user pass '--break-system-packages' (or '-B'),
which passes the same option down to pip, to allow the user
to bypass PEP 668 and install packages that could possibly
conflict with system packages.
2023-08-10 01:41:43 +01:00
Dima Gerasimov
d6af4dec11 my.instagram.android: minor cleanup + cachew 2023-06-21 20:42:10 +01:00
Dima Gerasimov
88a3aa8d67 my.bluemaestro: minor cleanup 2023-06-21 20:42:10 +01:00
Dima Gerasimov
c25ab51664 core: some tweaks for better colour handling when we're redirecting stdout/stderr 2023-06-21 20:42:10 +01:00
Dima Gerasimov
6f6be5c78e my.hackernews.materialistic: process and merge all db exports + minor cleanup 2023-06-21 20:42:10 +01:00
Dima Gerasimov
dff31455f1 general: switch to make_logger in a few modules, use a bit more consistent logging, rely on default INFO level 2023-06-21 18:42:15 +01:00
Dima Gerasimov
661714f1d9 core/logging: overhaul and many improvements -- mainly to deprecate abandoned logzero
- generally saner/cleaner logger initialization

  In particular now it doesn't override logging level specified by the user code prior to instantiating the logger.

  Also remove the `LazyLogger` hack, doesn't seem like it's necessary when the above is implemented.

- get rid of `logzero` which is archived and abandoned now, use `colorlog` for coloured logging formatter

- allow configuring log level via shell via `LOGGING_LEVEL_module_name=<level>`

  E.g. `LOGGING_LEVEL_rescuexport_dal=WARNING LOGGING_LEVEL_my_rescuetime=debug ./script.py`

- port `AddExceptionTraceback` from HPI/promnesia

- port `CollapseLogsHandler` from HPI/promnesia

  Also allow configuring from the shell, e.g. `LOGGING_COLLAPSE=<level>`

- add support for `enlighten` progress bar, so it can be shared between different projects

  See https://github.com/Rockhopper-Technologies/enlighten#readme

  This allows nice CLI progressbars, e.g. for parallel processing of different files from HPI:

    ghexport.dal[111]  29%|████████████████████████████████████████████████████████████████▏              |  29/100 [00:03<00:07, 10.03 files/s]
    rexport.dal[comments]  17%|████████████████████████████████████▋                                      | 115/682 [00:03<00:14, 39.15 files/s]
    my.instagram.android   0%|▎                                                                           |    3/2631 [00:02<34:50, 1.26 files/s]

  Currently off by default, and hidden behind an env variable (`ENLIGHTEN_ENABLE=true`)
2023-06-21 18:42:15 +01:00
Dima Gerasimov
6aa3d4225e sort out mypy after its update 2023-06-21 03:32:46 +01:00
Dima Gerasimov
ab7135d42f core: experimental import of my._init_hook to configure logging/warnings/env variables 2023-06-21 03:32:46 +01:00
Dima Gerasimov
c12224af74 misc: replace uses of pytz.utc with timezone.utc where it makes sense 2023-06-09 03:31:13 +01:00
Dima Gerasimov
c91534b966 set json files to empty dicts so they are at least valid jsons
(promnesia was stumbling over these, seems like the easiest fix :) )
2023-06-09 03:31:13 +01:00
Dima Gerasimov
5fe21240b4 core: move mcachew into my.core.cachew; use better typing annotations (copied from cachew) 2023-06-08 01:29:49 +01:00
Dima Gerasimov
f8cd31044e general: move reddit tests into my/tests + tweak my.core.cfg to be more reliable 2023-05-26 00:58:23 +01:00
Dima Gerasimov
fcfc423a75 move some tests into the main HPI package 2023-05-26 00:03:24 +01:00
Dima Gerasimov
9594caa1cd general: move most core tests inside my.core.tests package
- distributes tests alongside the package, might be convenient for package users
- removes some weird indirection (e.g. dummy test files improting tests from modules)
- makes the command line for tests cleaner (e.g. no need to remember to manually add files to tox.ini)
- tests automatically covered by mypy (so makes mypy runs cleaner and ultimately better coverage)

The (vague) convention is

- tests/somemodule.py -- testing my.core.somemodule, contains tests directly re
- tests/test_something.py -- testing a specific feature, e.g. test_get_files.py tests get_files methon only
2023-05-25 00:25:13 +01:00
Dima Gerasimov
04d976f937 my/core/pandas tests: fix weird pytest error when constructing dataclass inside a def
can quickly reproduce by running pytest tests/tz.py tests/core/test_pandas.py
possibly will be resolved after fix in pytest?
see https://github.com/pytest-dev/pytest/issues/7856
2023-05-24 22:32:44 +01:00
Dima Gerasimov
a98bc6daca my.core.pandas: rely on typing annotations from types-pandas 2023-05-24 22:32:44 +01:00
Dima Gerasimov
fe88380499 general: switch to using native 3.8 versions for cached_property/Literal/Protocol instead of compat 2023-05-16 01:18:30 +01:00
Dima Gerasimov
c34656e8fb general: update mypy config, seems that logs of type: ignore aren't necessary anymore 2023-05-16 01:18:30 +01:00
Dima Gerasimov
a445d2cbfe general: python3.7 will reach EOL soon, remove its support 2023-05-16 01:18:30 +01:00
seanbreckenridge
7a32302d66
query: add --warn-exceptions, dateparser, docs (#290)
* query: add --warn-exceptions, dateparser, docs

added --warn-exceptions (like --raise-exceptions/--drop-exceptions, but
lets you pass a warn_func if you want to customize how the exceptions are
handled. By default this creates a logger in main and logs the exception

added dateparser as a fallback if its installed (it's not a strong dependency, but
I mentioned in the docs that it's useful for parsing dates/times)

added docs for query, and a few examples

--output gpx respects the --{drop,warn,raise}--exceptions flags, have
an example of that in the docs as well
2023-04-18 00:15:35 +01:00
Sean Breckenridge
82bc51d9fc smscalls: make checking for keys stricter
sort of reverts #287, but also makes some other improvements

this allows us to remove some of the Optional's to
make downstream consumers easier to write. However,
this keeps the return type as a Res (result, with errors),
so downstream consumers will have to handle those incase
the schema ever changes (highly unlikely)

also added the 'call_type/message_type' with a comment
there describing the values

I left 'who' Optional I believe it actually should be -
its very possible for there to be no contact name, added
a check incase its '(Unknown)' which is what my phone
sets it to
2023-04-15 17:17:02 +01:00
seanbreckenridge
40de162fab
cli: add option to output locations to gpx files (#286)
* cli: add option to output locations to gpx files
2023-04-15 00:31:11 +01:00
Sean Breckenridge
02c738594f smscalls: make some fields optional, yield errors
reflects the new types-lxml package
https://github.com/abelcheung/types-lxml
2023-04-14 23:50:26 +01:00
Dima Gerasimov
d464b1e607 core: implement more methods for ZipPath and better support for get_files 2023-04-03 22:58:54 +01:00
Dima Gerasimov
0c5b2b4a09 my.whatsapp.android: initial module 2023-04-01 04:07:35 +01:00
Dima Gerasimov
8288032b1c my.telegram.telegram_backup: support optional extra_where and optional media info extraction for Promnesia 2023-03-27 03:27:13 +01:00
Dima Gerasimov
74710b339a telegram_backup: order messages by date and users/chats by id for determinism 2023-03-27 03:27:13 +01:00
Kian-Meng Ang
d2ef23fcb4 docs: fix typos
found via `codespell -L copie,datas,pres,fo,tooks,noo,ue,ket,frop`
2023-03-27 03:02:35 +01:00
Dima Gerasimov
919c84fb5a my.instagram: better unification of like messages/reactions 2023-03-27 02:16:17 +01:00
Dima Gerasimov
9aadbb504b my.instagram.android: properly extract our own user 2023-03-27 02:16:17 +01:00
Dima Gerasimov
8f7d14e7c6 my.instagram: somewhat mad merging mechanism to correlate gdpr and android exports 2023-03-27 02:16:17 +01:00
Dima Gerasimov
e7be680841 my.instagram.gdpr: handle missing message content defensively 2023-03-27 02:16:17 +01:00
Dima Gerasimov
347cd1ef77 my.fbmessenger: add Sender protocol for consistency 2023-03-17 00:33:22 +00:00
Dima Gerasimov
58d2e25a42 ci: suppress some mypy issues after upgrade 2023-03-17 00:33:22 +00:00
Dima Gerasimov
bef832cbff my.fbmessenger.export: remove legacy dump_chat_history code 2023-03-17 00:33:22 +00:00
Dima Gerasimov
0a05b27266 my.fbmessenger.android: set timezone to utc 2023-03-17 00:33:22 +00:00
Dima Gerasimov
457797bdfb my.bumble.android: better handling for missing conversation id in database 2023-03-17 00:33:22 +00:00
Dima Gerasimov
9db5f318fb my.twitter.twint: use dict row factory instead of sqlite Row
otherwise it's not json serializable
2023-03-17 00:33:22 +00:00
seanbreckenridge
79eeab2128
cli completion doc updates, hide legacy import warning (#279)
* core/cli: hide warnings when autocompleting

* link to completion in setup/troubleshooting
* update completion docs to make source path clear
2023-03-06 21:36:36 +00:00
seanbreckenridge
9d231a8ea9
google_takeout: add semantic location history (#278)
* google_takeout: add semantic location history
2023-03-04 18:36:10 +00:00
Dima Gerasimov
a4c713664e core.logging: sync logging helper with Promnesia, adds more goodies
- print exception traceback by default when using logger.exception
- COLLAPSE_DEBUG_LOGS env variable
2023-03-03 21:14:11 +00:00
Dima Gerasimov
bee17d932b fbmessenger.android: use Optional name, best to leave for the consumer to decide how to behave when it's unavailable
e.g. using  <NAME UNAVAILABLE> was causing issues when used as zulip contact name
2023-03-03 21:14:11 +00:00
Dima Gerasimov
4dfc4029c3 core.kompress: proper support for read_text/read_bytes against zstd/xz archives 2023-03-03 21:14:11 +00:00
Dima Gerasimov
b94904f5ee core.kompress: support .zst extension, seems more conventional than .zstd 2023-03-03 21:14:11 +00:00
Sean Breckenridge
db2cd00bed try removing parallel on mac to prevent CI failure 2023-02-28 20:55:12 +00:00
Sean Breckenridge
a70118645b my.ip.common: remove REQUIRES
no reason to have it there since its
__NOT_HPI_MODULE__, so is not discoverable anyways
2023-02-28 20:55:12 +00:00
Sean Breckenridge
f36bc6144b tox: use my.ip.all, sort hpi installs 2023-02-28 20:55:12 +00:00
Sean Breckenridge
435cb020f9 add example for denylist, update ci 2023-02-28 20:55:12 +00:00
seanbreckenridge
98b086f746
location fallback (#263)
see https://github.com/karlicoss/HPI/issues/262

* move home to fallback/via_home.py
* move via_ip to fallback
* add fallback model
* add stub via_ip file
* add fallback_locations for via_ip
* use protocol for locations
* estimate_from helper, via_home estimator, all.py
* via_home: add accuracy, cache history
* add datasources to gpslogger/google_takeout
* tz/via_location.py: update import to fallback
* denylist docs/installation instructions
* tz.via_location: let user customize cachew refresh time
* add via_ip.estimate_location using binary search
* use estimate_location in via_home.get_location
* tests: add gpslogger to location config stub
* tests: install tz related libs in test env
* tz: add regression test for broken windows dates

* vendorize bisect_left from python src
doesnt have a 'key' parameter till python3.10
2023-02-28 04:30:06 +00:00
Dima Gerasimov
6dc5e7575f vk_messages_backup: add unique_everseen to prevent duplicate messages 2023-02-28 03:55:44 +00:00
Dima Gerasimov
a7099e2efc vk_messages_backup: more correct handling of group chats & better chat ids 2023-02-28 03:55:44 +00:00
Dima Gerasimov
02c98143d5 vk_messages_backup: better structure & exract richer information 2023-02-28 03:55:44 +00:00
266 changed files with 11043 additions and 4909 deletions

View file

@ -21,7 +21,7 @@ import shutil
is_ci = os.environ.get('CI') is not None
def main():
def main() -> None:
import argparse
p = argparse.ArgumentParser()
p.add_argument('--test', action='store_true', help='use test pypi')
@ -29,7 +29,7 @@ def main():
extra = []
if args.test:
extra.extend(['--repository-url', 'https://test.pypi.org/legacy/'])
extra.extend(['--repository', 'testpypi'])
root = Path(__file__).absolute().parent.parent
os.chdir(root) # just in case
@ -42,7 +42,7 @@ def main():
if dist.exists():
shutil.rmtree(dist)
check_call('python3 setup.py sdist bdist_wheel', shell=True)
check_call(['python3', '-m', 'build'])
TP = 'TWINE_PASSWORD'
password = os.environ.get(TP)

View file

@ -11,6 +11,8 @@ if ! command -v sudo; then
}
fi
# --parallel-live to show outputs while it's running
tox_cmd='run-parallel --parallel-live'
if [ -n "${CI-}" ]; then
# install OS specific stuff here
case "$OSTYPE" in
@ -20,7 +22,8 @@ if [ -n "${CI-}" ]; then
;;
cygwin* | msys* | win*)
# windows
:
# ugh. parallel stuff seems super flaky under windows, some random failures, "file used by other process" and crap like that
tox_cmd='run'
;;
*)
# must be linux?
@ -37,5 +40,9 @@ if ! command -v python3 &> /dev/null; then
PY_BIN="python"
fi
"$PY_BIN" -m pip install --user tox
"$PY_BIN" -m tox
# TODO hmm for some reason installing uv with pip and then running
# "$PY_BIN" -m uv tool fails with missing setuptools error??
# just uvx directly works, but it's not present in PATH...
"$PY_BIN" -m pip install --user pipx
"$PY_BIN" -m pipx run uv tool run --with=tox-uv tox $tox_cmd "$@"

View file

@ -5,24 +5,36 @@ on:
push:
branches: '*'
tags: 'v[0-9]+.*' # only trigger on 'release' tags for PyPi
# Note that people who fork it need to go to "Actions" tab on their fork and click "I understand my workflows, go ahead and enable them".
# Ideally I would put this in the pypi job... but github syntax doesn't allow for regexes there :shrug:
pull_request: # needed to trigger on others' PRs
# Note that people who fork it need to go to "Actions" tab on their fork and click "I understand my workflows, go ahead and enable them".
workflow_dispatch: # needed to trigger workflows manually
# todo cron?
inputs:
debug_enabled:
type: boolean
description: 'Run the build with tmate debugging enabled (https://github.com/marketplace/actions/debugging-with-tmate)'
required: false
default: false
jobs:
build:
strategy:
fail-fast: false
matrix:
platform: [ubuntu-latest, macos-latest, windows-latest]
python-version: ['3.7', '3.8', '3.9', '3.10']
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
exclude: [
# windows runners are pretty scarce, so let's only run one of them..
{platform: windows-latest, python-version: '3.7' },
{platform: windows-latest, python-version: '3.9' },
# windows runners are pretty scarce, so let's only run lowest and highest python version
{platform: windows-latest, python-version: '3.10'},
{platform: windows-latest, python-version: '3.11'},
{platform: windows-latest, python-version: '3.12'},
# same, macos is a bit too slow and ubuntu covers python quirks well
{platform: macos-latest , python-version: '3.10' },
{platform: macos-latest , python-version: '3.11' },
{platform: macos-latest , python-version: '3.12' },
]
runs-on: ${{ matrix.platform }}
@ -34,29 +46,31 @@ jobs:
# ugh https://github.com/actions/toolkit/blob/main/docs/commands.md#path-manipulation
- run: echo "$HOME/.local/bin" >> $GITHUB_PATH
- uses: actions/setup-python@v3
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- uses: actions/checkout@v3
- uses: actions/checkout@v4
with:
submodules: recursive
fetch-depth: 0 # nicer to have all git history when debugging/for tests
# uncomment for SSH debugging
# - uses: mxschmitt/action-tmate@v3
- uses: mxschmitt/action-tmate@v3
if: ${{ github.event_name == 'workflow_dispatch' && inputs.debug_enabled }}
# explicit bash command is necessary for Windows CI runner, otherwise it thinks it's cmd...
- run: bash scripts/ci/run
- run: bash .ci/run
- if: matrix.platform == 'ubuntu-latest' # no need to compute coverage for other platforms
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
include-hidden-files: true
name: .coverage.mypy-misc_${{ matrix.platform }}_${{ matrix.python-version }}
path: .coverage.mypy-misc/
- if: matrix.platform == 'ubuntu-latest' # no need to compute coverage for other platforms
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
include-hidden-files: true
name: .coverage.mypy-core_${{ matrix.platform }}_${{ matrix.python-version }}
path: .coverage.mypy-core/
@ -68,11 +82,11 @@ jobs:
# ugh https://github.com/actions/toolkit/blob/main/docs/commands.md#path-manipulation
- run: echo "$HOME/.local/bin" >> $GITHUB_PATH
- uses: actions/setup-python@v3
- uses: actions/setup-python@v5
with:
python-version: '3.8'
python-version: '3.10'
- uses: actions/checkout@v3
- uses: actions/checkout@v4
with:
submodules: recursive
@ -81,8 +95,7 @@ jobs:
if: github.event_name != 'pull_request' && github.event.ref == 'refs/heads/master'
env:
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD_TEST }}
run: pip3 install --user wheel twine && scripts/release --test
# TODO run pip install just to test?
run: pip3 install --user --upgrade build twine && .ci/release --test
- name: 'release to pypi'
# always deploy tags to release pypi
@ -90,4 +103,4 @@ jobs:
if: github.event_name != 'pull_request' && startsWith(github.event.ref, 'refs/tags')
env:
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
run: pip3 install --user wheel twine && scripts/release
run: pip3 install --user --upgrade build twine && .ci/release

4
.gitignore vendored
View file

@ -12,6 +12,7 @@
auto-save-list
tramp
.\#*
*.gpx
# Org-mode
.org-id-locations
@ -154,6 +155,9 @@ celerybeat-schedule
.dmypy.json
dmypy.json
# linters
.ruff_cache/
# Pyre type checker
.pyre/

View file

@ -17,10 +17,10 @@ General/my.core changes:
- 746c3da0cadcba3b179688783186d8a0bd0999c5 core.pandas: allow specifying schema; add tests
- 5313984d8fea2b6eef6726b7b346c1f4316acd01 add `tmp_config` context manager for test & adhoc patching
- df9a7f7390aee6c69f1abf1c8d1fc7659ebb957c core.pandas: add check for 'error' column + add empty one by default
- e81dddddf083ffd81aa7e2b715bd34f59949479c proprely resolve class properties in make_config + add test
- e81dddddf083ffd81aa7e2b715bd34f59949479c properly resolve class properties in make_config + add test
Modules:
- some innitial work on filling **InfluxDB** with HPI data
- some initial work on filling **InfluxDB** with HPI data
- pinboard
- 42399f6250d9901d93dcedcfe05f7857babcf834: **breaking backwards compatibility**, use pinbexport module directly

View file

@ -531,7 +531,7 @@ If you like the shell or just want to quickly convert/grab some information from
#+begin_src bash
$ hpi query my.coding.commits.commits --stream # stream JSON objects as they're read
--order-type datetime # find the 'datetime' attribute and order by that
--after '2020-01-01 00:00:00' --before '2020-12-31 23:59:59' # in 2020
--after '2020-01-01' --before '2021-01-01' # in 2020
| jq '.committed_dt' -r # extract the datetime
# mangle the output a bit to group by month and graph it
| cut -d'-' -f-2 | sort | uniq -c | awk '{print $2,$1}' | sort -n | termgraph
@ -552,6 +552,8 @@ If you like the shell or just want to quickly convert/grab some information from
2020-12: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 383.00
#+end_src
See [[https://github.com/karlicoss/HPI/blob/master/doc/QUERY.md][query docs]]
for more examples
** Querying Roam Research database
:PROPERTIES:
@ -721,10 +723,10 @@ If you want to write modules for personal use but don't want to merge them into
Other HPI Repositories:
- [[https://github.com/seanbreckenridge/HPI][seanbreckenridge/HPI]]
- [[https://github.com/purarue/HPI][purarue/HPI]]
- [[https://github.com/madelinecameron/hpi][madelinecameron/HPI]]
If you want to create your own to create your own modules/override something here, you can use the [[https://github.com/seanbreckenridge/HPI-template][template]].
If you want to create your own to create your own modules/override something here, you can use the [[https://github.com/purarue/HPI-template][template]].
* Related links
:PROPERTIES:

47
conftest.py Normal file
View file

@ -0,0 +1,47 @@
# this is a hack to monkey patch pytest so it handles tests inside namespace packages without __init__.py properly
# without it, pytest can't discover the package root for some reason
# also see https://github.com/karlicoss/pytest_namespace_pkgs for more
import os
import pathlib
from typing import Optional
import _pytest.main
import _pytest.pathlib
# we consider all dirs in repo/ to be namespace packages
root_dir = pathlib.Path(__file__).absolute().parent.resolve() # / 'src'
assert root_dir.exists(), root_dir
# TODO assert it contains package name?? maybe get it via setuptools..
namespace_pkg_dirs = [str(d) for d in root_dir.iterdir() if d.is_dir()]
# resolve_package_path is called from _pytest.pathlib.import_path
# takes a full abs path to the test file and needs to return the path to the 'root' package on the filesystem
resolve_pkg_path_orig = _pytest.pathlib.resolve_package_path
def resolve_package_path(path: pathlib.Path) -> Optional[pathlib.Path]:
result = path # search from the test file upwards
for parent in result.parents:
if str(parent) in namespace_pkg_dirs:
return parent
if os.name == 'nt':
# ??? for some reason on windows it is trying to call this against conftest? but not on linux/osx
if path.name == 'conftest.py':
return resolve_pkg_path_orig(path)
raise RuntimeError("Couldn't determine path for ", path)
_pytest.pathlib.resolve_package_path = resolve_package_path
# without patching, the orig function returns just a package name for some reason
# (I think it's used as a sort of fallback)
# so we need to point it at the absolute path properly
# not sure what are the consequences.. maybe it wouldn't be able to run against installed packages? not sure..
search_pypath_orig = _pytest.main.search_pypath
def search_pypath(module_name: str) -> str:
mpath = root_dir / module_name.replace('.', os.sep)
if not mpath.is_dir():
mpath = mpath.with_suffix('.py')
assert mpath.exists(), mpath # just in case
return str(mpath)
_pytest.main.search_pypath = search_pypath

11
demo.py
View file

@ -1,6 +1,6 @@
#!/usr/bin/env python3
from subprocess import check_call, DEVNULL
from shutil import copy, copytree
from shutil import copytree, ignore_patterns
import os
from os.path import abspath
from sys import executable as python
@ -9,12 +9,17 @@ from pathlib import Path
my_repo = Path(__file__).absolute().parent
def run():
def run() -> None:
# uses fixed paths; worth it for the sake of demonstration
# assumes we're in /tmp/my_demo now
# 1. clone git@github.com:karlicoss/my.git
copytree(my_repo, 'my_repo', symlinks=True)
copytree(
my_repo,
'my_repo',
symlinks=True,
ignore=ignore_patterns('.tox*'), # tox dir might have broken symlinks while tests are running in parallel
)
# 2. prepare repositories you'd be using. For this demo we only set up Hypothesis
tox = 'TOX' in os.environ

130
doc/DENYLIST.md Normal file
View file

@ -0,0 +1,130 @@
For code reference, see: [`my.core.denylist.py`](../my/core/denylist.py)
A helper module for defining denylists for sources programmatically (in layman's terms, this lets you remove some particular output from a module you don't want)
Lets you specify a class, an attribute to match on,
and a JSON file containing a list of values to deny/filter out
As an example, this will use the `my.ip` module, as filtering incorrect IPs was the original use case for this module:
```python
class IP(NamedTuple):
addr: str
dt: datetime
```
A possible denylist file would contain:
```json
[
{
"addr": "192.168.1.1",
},
{
"dt": "2020-06-02T03:12:00+00:00",
}
]
```
Note that if the value being compared to is not a single (non-array/object) JSON primitive
(str, int, float, bool, None), it will be converted to a string before comparison
To use this in code:
```python
from my.ip.all import ips
filtered = DenyList("~/data/ip_denylist.json").filter(ips())
```
To add items to the denylist, in python (in a one-off script):
```python
from my.ip.all import ips
from my.core.denylist import DenyList
d = DenyList("~/data/ip_denylist.json")
for ip in ips():
# some custom code you define
if ip.addr == ...:
d.deny(key="ip", value=ip.ip)
d.write()
```
... or interactively, which requires [`fzf`](https://github.com/junegunn/fzf) and [`pyfzf-iter`](https://pypi.org/project/pyfzf-iter/) (`python3 -m pip install pyfzf-iter`) to be installed:
```python
from my.ip.all import ips
from my.core.denylist import DenyList
d = DenyList("~/data/ip_denylist.json")
d.deny_cli(ips()) # automatically writes after each selection
```
That will open up an interactive `fzf` prompt, where you can select an item to add to the denylist
This is meant for relatively simple filters, where you want to filter items out
based on a single attribute of a namedtuple/dataclass. If you want to do something
more complex, I would recommend overriding the `all.py` file for that source and
writing your own filter function there.
For more info on all.py:
https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy
This would typically be used in an overridden `all.py` file, or in a one-off script
which you may want to filter out some items from a source, progressively adding more
items to the denylist as you go.
A potential `my/ip/all.py` file might look like (Sidenote: `discord` module from [here](https://github.com/purarue/HPI)):
```python
from typing import Iterator
from my.ip.common import IP
from my.core.denylist import DenyList
deny = DenyList("~/data/ip_denylist.json")
# all possible data from the source
def _ips() -> Iterator[IP]:
from my.ip import discord
# could add other imports here
yield from discord.ips()
# filtered data
def ips() -> Iterator[IP]:
yield from deny.filter(_ips())
```
To add items to the denylist, you could create a `__main__.py` in your namespace package (in this case, `my/ip/__main__.py`), with contents like:
```python
from my.ip import all
if __name__ == "__main__":
all.deny.deny_cli(all.ips())
```
Which could then be called like: `python3 -m my.ip`
Or, you could just run it from the command line:
```
python3 -c 'from my.ip import all; all.deny.deny_cli(all.ips())'
```
To edit the `all.py`, you could either:
- install it as editable (`python3 -m pip install --user -e ./HPI`), and then edit the file directly
- or, create a namespace package, which splits the package across multiple directories. For info on that see [`MODULE_DESIGN`](https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#namespace-packages), [`reorder_editable`](https://github.com/purarue/reorder_editable), and possibly the [`HPI-template`](https://github.com/purarue/HPI-template) to create your own HPI namespace package to create your own `all.py` file.
For a real example of this see, [purarue/HPI-personal](https://github.com/purarue/HPI-personal/blob/master/my/ip/all.py)
Sidenote: the reason why we want to specifically override
the all.py and not just create a script that filters out the items you're
not interested in is because we want to be able to import from `my.ip.all`
or `my.location.all` from other modules and get the filtered results, without
having to mix data filtering logic with parsing/loading/caching (the stuff HPI does)

View file

@ -4,7 +4,7 @@ note: this doc is in progress
- interoperable
# note: this link doesnt work in org, but does for the github preview
# note: this link doesn't work in org, but does for the github preview
This is the main motivation and [[file:../README.org#why][why]] I created HPI in the first place.
Ideally it should be possible to hook into anything you can imagine -- regardless the database/programming language/etc.

View file

@ -76,7 +76,7 @@ The config snippets below are meant to be modified accordingly and *pasted into
You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.
For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[https://github.com/seanbreckenridge/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]
For an extensive/complex example, you can check out ~@purarue~'s [[https://github.com/purarue/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]
# Nested Configurations before the doc generation using the block below
** [[file:../my/reddit][my.reddit]]
@ -96,7 +96,7 @@ For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[http
class pushshift:
'''
Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
Uses [[https://github.com/purarue/pushshift_comment_export][pushshift]] to get access to old comments
'''
# path[s]/glob to the exported JSON data
@ -106,7 +106,7 @@ For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[http
** [[file:../my/browser/][my.browser]]
Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]]
Parses browser history using [[http://github.com/purarue/browserexport][browserexport]]
#+begin_src python
class browser:
@ -132,7 +132,7 @@ For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[http
You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to
provide geolocation data for an IPs (though no IPs are provided from any
of the sources here). For an example of usage, see [[https://github.com/seanbreckenridge/HPI/tree/master/my/ip][here]]
of the sources here). For an example of usage, see [[https://github.com/purarue/HPI/tree/master/my/ip][here]]
#+begin_src python
class location:
@ -256,9 +256,9 @@ for cls, p in modules:
** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]]
Parses Google Takeout using [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]]
Parses Google Takeout using [[https://github.com/purarue/google_takeout_parser][google_takeout_parser]]
See [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts
See [[https://github.com/purarue/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts
If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't
cache individual exports in =~/.cache/google_takeout_parser=

View file

@ -2,6 +2,19 @@ Some thoughts on modules, how to structure them, and adding your own/extending H
This is slightly more advanced, and would be useful if you're trying to extend HPI by developing your own modules, or contributing back to HPI
* TOC
:PROPERTIES:
:TOC: :include all :depth 1 :force (nothing) :ignore (this) :local (nothing)
:END:
:CONTENTS:
- [[#allpy][all.py]]
- [[#module-count][module count]]
- [[#single-file-modules][single file modules]]
- [[#adding-new-modules][Adding new modules]]
- [[#an-extendable-module-structure][An Extendable module structure]]
- [[#logging-guidelines][Logging guidelines]]
:END:
* all.py
Some modules have lots of different sources for data. For example, ~my.location~ (location data) has lots of possible sources -- from ~my.google.takeout.parser~, using the ~gpslogger~ android app, or through geo locating ~my.ip~ addresses. For a module with multiple possible sources, its common to split it into files like:
@ -54,7 +67,7 @@ If you want to disable a source, you have a few options.
... that suppresses the warning message and lets you use ~my.location.all~ without having to change any lines of code
Another benefit is that all the custom sources/data is localized to the ~all.py~ file, so a user can override the ~all.py~ (see the sections below on ~namespace packages~) file in their own HPI repository, adding additional sources without having to maintain a fork and patching in changes as things eventually change. For a 'real world' example of that, see [[https://github.com/seanbreckenridge/HPI#partially-in-usewith-overrides][seanbreckenridge]]s location and ip modules.
Another benefit is that all the custom sources/data is localized to the ~all.py~ file, so a user can override the ~all.py~ (see the sections below on ~namespace packages~) file in their own HPI repository, adding additional sources without having to maintain a fork and patching in changes as things eventually change. For a 'real world' example of that, see [[https://github.com/purarue/HPI#partially-in-usewith-overrides][purarue]]s location and ip modules.
This is of course not required for personal or single file modules, its just the pattern that seems to have the least amount of friction for the user, while being extendable, and without using a bulky plugin system to let users add additional sources.
@ -113,7 +126,7 @@ Not all HPI Modules are currently at that level of complexity -- some are simple
A related concern is how to structure namespace packages to allow users to easily extend them, and how this conflicts with single file modules (Keep reading below for more information on namespace packages/extension) If a module is converted from a single file module to a namespace with multiple files, it seems this is a breaking change, see [[https://github.com/karlicoss/HPI/issues/89][#89]] for an example of this. The current workaround is to leave it a regular python package with an =__init__.py= for some amount of time and send a deprecation warning, and then eventually remove the =__init__.py= file to convert it into a namespace package. For an example, see the [[https://github.com/karlicoss/HPI/blob/8422c6e420f5e274bd1da91710663be6429c666c/my/reddit/__init__.py][reddit init file]].
Its quite a pain to have to convert a file from a single file module to a namespace module, so if theres *any* possibility that you might convert it to a namespace package, might as well just start it off as one, to avoid the pain down the road. As an example, say you were creating something to parse ~zsh~ history. Instead of creating ~my/zsh.py~, it would be better to create ~my/zsh/parser.py~. That lets users override the file using editable/namespace packages, and it also means in the future its much more trivial to extend it to something like:
Its quite a pain to have to convert a file from a single file module to a namespace module, so if there's *any* possibility that you might convert it to a namespace package, might as well just start it off as one, to avoid the pain down the road. As an example, say you were creating something to parse ~zsh~ history. Instead of creating ~my/zsh.py~, it would be better to create ~my/zsh/parser.py~. That lets users override the file using editable/namespace packages, and it also means in the future its much more trivial to extend it to something like:
#+begin_src
my/zsh
@ -161,7 +174,7 @@ There's no requirement to follow this entire structure when you start off, the e
Note: this section covers some of the complexities and benefits with this being a namespace package and/or editable install, so it assumes some familiarity with python/imports
HPI is installed as a namespace package, which allows an additional way to add your own modules. For the details on namespace packges, see [[https://www.python.org/dev/peps/pep-0420/][PEP420]], or the [[https://packaging.python.org/guides/packaging-namespace-packages][packaging docs for a summary]], but for our use case, a sufficient description might be: Namespace packages let you split a package across multiple directories on disk.
HPI is installed as a namespace package, which allows an additional way to add your own modules. For the details on namespace packages, see [[https://www.python.org/dev/peps/pep-0420/][PEP420]], or the [[https://packaging.python.org/guides/packaging-namespace-packages][packaging docs for a summary]], but for our use case, a sufficient description might be: Namespace packages let you split a package across multiple directories on disk.
Without adding a bulky/boilerplate-y plugin framework to HPI, as that increases the barrier to entry, [[https://packaging.python.org/guides/creating-and-discovering-plugins/#using-namespace-packages][namespace packages offers an alternative]] with little downsides.
@ -195,13 +208,13 @@ Where ~lastfm.py~ is your version of ~my.lastfm~, which you've copied from this
Then, running ~python3 -m pip install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~lastfm.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.
If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/seanbreckenridge/reorder_editable][reorder_editable]] repository.
If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/purarue/reorder_editable][reorder_editable]] repository.
There is no limit to how many directories you could install into a single namespace package, which could be a possible way for people to install additional HPI modules, without worrying about the module count here becoming too large to manage.
There are some other users [[https://github.com/hpi/hpi][who have begun publishing their own modules]] as namespace packages, which you could potentially install and use, in addition to this repository, if any of those interest you. If you want to create your own you can use the [[https://github.com/seanbreckenridge/HPI-template][template]] to get started.
There are some other users [[https://github.com/hpi/hpi][who have begun publishing their own modules]] as namespace packages, which you could potentially install and use, in addition to this repository, if any of those interest you. If you want to create your own you can use the [[https://github.com/purarue/HPI-template][template]] to get started.
Though, enabling this many modules may make ~hpi doctor~ look pretty busy. You can explicitly choose to enable/disable modules with a list of modules/regexes in your [[https://github.com/karlicoss/HPI/blob/f559e7cb899107538e6c6bbcf7576780604697ef/my/core/core_config.py#L24-L55][core config]], see [[https://github.com/seanbreckenridge/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L42-L72][here]] for an example.
Though, enabling this many modules may make ~hpi doctor~ look pretty busy. You can explicitly choose to enable/disable modules with a list of modules/regexes in your [[https://github.com/karlicoss/HPI/blob/f559e7cb899107538e6c6bbcf7576780604697ef/my/core/core_config.py#L24-L55][core config]], see [[https://github.com/purarue/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L42-L72][here]] for an example.
You may use the other modules or [[https://github.com/karlicoss/hpi-personal-overlay][my overlay]] as reference, but python packaging is already a complicated issue, before adding complexities like namespace packages and editable installs on top of it... If you're having trouble extending HPI in this fashion, you can open an issue here, preferably with a link to your code/repository and/or ~setup.py~ you're trying to use.
@ -226,11 +239,93 @@ The main goals are:
- doesn't require you to maintain a fork of this repository, though you can maintain a separate HPI repository (so no patching/merge conflicts)
- allows you to easily add/remove sources to the ~all.py~ module, either by:
- overriding an ~all.py~ in your own repository
- just commenting out the source/adding 2 lines to import and ~yield
from~ your new source
- just commenting out the source/adding 2 lines to import and ~yield from~ your new source
- doing nothing! (~import_source~ will catch the error and just warn you
and continue to work without changing any code)
It could be argued that namespace packages and editable installs are a bit complex for a new user to get the hang of, and this is true. But fortunately ~import_source~ means any user just using HPI only needs to follow the instructions when a warning is printed, or peruse the docs here a bit -- there's no need to clone or create your own override to just use the ~all.py~ file.
There's no requirement to use this for individual modules, it just seems to be the best solution we've arrived at so far
* Logging guidelines
HPI doesn't enforce any specific logging mechanism, you're free to use whatever you prefer in your modules.
However there are some general guidelines for developing modules that can make them more pleasant to use.
- each module should have its unique logger, the easiest way to ensure that is simply use module's ~__name__~ attribute as the logger name
In addition, this ensures the logger hierarchy reflect the package hierarchy.
For instance, if you initialize the logger for =my.module= with specific settings, the logger for =my.module.helper= would inherit these settings. See more on that [[ https://docs.python.org/3/library/logging.html?highlight=logging#logger-objects][in python docs]].
As a bonus, if you use the module ~__name__~, this logger will be automatically be picked up and used by ~cachew~.
- often modules are processing multiple files, extracting data from each one ([[https://beepb00p.xyz/exports.html#types][incremental/synthetic exports]])
It's nice to log each file name you're processing as =logger.info= so the user of module gets a sense of progress.
If possible, add the index of file you're processing and the total count.
#+begin_src python
def process_all_data():
paths = inputs()
total = len(paths)
width = len(str(total))
for idx, path in enumerate(paths):
# :>{width} to align the logs vertically
logger.info(f'processing [{idx:>{width}}/{total:>{width}}] {path}')
yield from process_path(path)
#+end_src
If there is a lot of logging happening related to a specific path, instead of adding path to each logging message manually, consider using [[https://docs.python.org/3/library/logging.html?highlight=loggeradapter#logging.LoggerAdapter][LoggerAdapter]].
- log exceptions, but sparingly
Generally it's a good practice to call ~logging.exception~ from the ~except~ clause, so it's immediately visible where the errors are happening.
However, in HPI, instead of crashing on exceptions we often behave defensively and ~yield~ them instead (see [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]]).
In this case logging every time may become a bit spammy, so use exception logging sparingly in this case.
Typically it's best to rely on the downstream data consumer to handle the exceptions properly.
- instead of =logging.getLogger=, it's best to use =my.core.make_logger=
#+begin_src python
from my.core import make_logger
logger = make_logger(__name__)
# or to set a custom level
logger = make_logger(__name__, level='warning')
#+end_src
This sets up some nicer defaults over standard =logging= module:
- colored logs (via =colorlog= library)
- =INFO= as the initial logging level (instead of default =ERROR=)
- logging full exception trace when even when logging outside of the exception handler
This is particularly useful for [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]].
By default, =logging= only logs the exception message (without the trace) in this case, which makes errors harder to debug.
- control logging level from the shell via ~LOGGING_LEVEL_*~ env variable
This can be useful to suppress logging output if it's too spammy, or showing more output for debugging.
E.g. ~LOGGING_LEVEL_my_instagram_gdpr=DEBUG hpi query my.instagram.gdpr.messages~
- experimental: passing env variable ~LOGGING_COLLAPSE=<loglevel>~ will "collapse" logging with the same level
Instead of printing new logging line each time, it will 'redraw' the last logged line with a new logging message.
This can be convenient if there are too many logs, you just need logging to get a sense of progress.
- experimental: passing env variable ~ENLIGHTEN_ENABLE=yes~ will display TUI progress bars in some cases
See [[https://github.com/Rockhopper-Technologies/enlighten#readme][https://github.com/Rockhopper-Technologies/enlighten#readme]]
This can be convenient for showing the progress of parallel processing of different files from HPI:
#+BEGIN_EXAMPLE
ghexport.dal[111] 29%|████████████████████ | 29/100 [00:03<00:07, 10.03 files/s]
rexport.dal[comments] 17%|████████ | 115/682 [00:03<00:14, 39.15 files/s]
my.instagram.android 0%|▎ | 3/2631 [00:02<34:50, 1.26 files/s]
#+END_EXAMPLE

322
doc/OVERLAYS.org Normal file
View file

@ -0,0 +1,322 @@
NOTE this kinda overlaps with [[file:MODULE_DESIGN.org][the module design doc]], should be unified in the future.
Relevant discussion about overlays: https://github.com/karlicoss/HPI/issues/102
# This is describing TODO
# TODO goals
# - overrides
# - proper mypy support
# - TODO reusing parent modules?
# You can see them TODO in overlays dir
Consider a toy package/module structure with minimal code, without any actual data parsing, just for demonstration purposes.
- =main= package structure
# TODO do links
- =my/twitter/gdpr.py=
Extracts Twitter data from GDPR archive.
- =my/twitter/all.py=
Merges twitter data from multiple sources (only =gdpr= in this case), so data consumers are agnostic of specific data sources used.
This will be overridden by =overlay=.
- =my/twitter/common.py=
Contains helper function to merge data, so they can be reused by overlay's =all.py=.
- =my/reddit.py=
Extracts Reddit data -- this won't be overridden by the overlay, we just keep it for demonstration purposes.
- =overlay= package structure
- =my/twitter/talon.py=
Extracts Twitter data from Talon android app.
- =my/twitter/all.py=
Override for =all.py= from =main= package -- it merges together data from =gpdr= and =talon= modules.
# TODO mention resolution? reorder_editable
* Installing (editable install)
NOTE: this was tested with =python 3.10= and =pip 23.3.2=.
To install, we run:
: pip3 install --user -e overlay/
: pip3 install --user -e main/
# TODO mention non-editable installs (this bit will still work with non-editable install)
As a result, we get:
: pip3 list | grep hpi
: hpi-main 0.0.0 /project/main/src
: hpi-overlay 0.0.0 /project/overlay/src
: cat ~/.local/lib/python3.10/site-packages/easy-install.pth
: /project/overlay/src
: /project/main/src
(the order above is important, so =overlay= takes precedence over =main= TODO link)
Verify the setup:
: $ python3 -c 'import my; print(my.__path__)'
: _NamespacePath(['/project/overlay/src/my', '/project/main/src/my'])
This basically means that modules will be searched in both paths, with overlay taking precedence.
** Installing with =--use-pep517=
See here for discussion https://github.com/purarue/reorder_editable/issues/2, but TLDR it should work similarly.
* Testing runtime behaviour (editable install)
: $ python3 -c 'import my.reddit as R; print(R.upvotes())'
: [main] my.reddit hello
: ['reddit upvote1', 'reddit upvote2']
Just as expected here, =my.reddit= is imported from the =main= package, since it doesn't exist in =overlay=.
Let's theck twitter now:
: $ python3 -c 'import my.twitter.all as T; print(T.tweets())'
: [overlay] my.twitter.all hello
: [main] my.twitter.common hello
: [main] my.twitter.gdpr hello
: [overlay] my.twitter.talon hello
: ['gdpr tweet 1', 'gdpr tweet 2', 'talon tweet 1', 'talon tweet 2']
As expected, =my.twitter.all= was imported from the =overlay=.
As you can see it's merged data from =gdpr= (from =main= package) and =talon= (from =overlay= package).
So far so good, let's see how it works with mypy.
* Mypy support (editable install)
To check that mypy works as expected I injected some statements in modules that have no impact on runtime,
but should trigger mypy, like this =trigger_mypy_error: str = 123=:
Let's run it:
: $ mypy --namespace-packages --strict -p my
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^
: Found 1 error in 1 file (checked 4 source files)
Hmm, this did find the statement in the =overlay=, but missed everything from =main= (e.g. =reddit.py= and =gdpr.py= should have also triggered the check).
First, let's check which sources mypy is processing:
: $ mypy --namespace-packages --strict -p my -v 2>&1 | grep BuildSource
: LOG: Found source: BuildSource(path='/project/overlay/src/my', module='my', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter', module='my.twitter', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter/all.py', module='my.twitter.all', has_text=False, base_dir=None)
: LOG: Found source: BuildSource(path='/project/overlay/src/my/twitter/talon.py', module='my.twitter.talon', has_text=False, base_dir=None)
So seems like mypy is not processing anything from =main= package at all?
At this point I cloned mypy, put a breakpoint, and found out this is the culprit: https://github.com/python/mypy/blob/1dd8e7fe654991b01bd80ef7f1f675d9e3910c3a/mypy/modulefinder.py#L288
This basically returns the first path where it finds =my= package, which happens to be the overlay in this case.
So everything else is ignored?
It even seems to have a test for a similar usecase, which is quite sad.
https://github.com/python/mypy/blob/1dd8e7fe654991b01bd80ef7f1f675d9e3910c3a/mypy/test/testmodulefinder.py#L64-L71
For now, I opened an issue in mypy repository https://github.com/python/mypy/issues/16683
But ok, maybe mypy treats =main= as an external package somehow but still type checks it properly?
Let's see what's going on with imports:
: $ mypy --namespace-packages --strict -p my --follow-imports=error
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^
: overlay/src/my/twitter/all.py:3: error: Import of "my.twitter.common" ignored [misc]
: from .common import merge
: ^
: overlay/src/my/twitter/all.py:6: error: Import of "my.twitter.gdpr" ignored [misc]
: from . import gdpr
: ^
: overlay/src/my/twitter/all.py:6: note: (Using --follow-imports=error, module not passed on command line)
: overlay/src/my/twitter/all.py: note: In function "tweets":
: overlay/src/my/twitter/all.py:8: error: Returning Any from function declared to return "List[str]" [no-any-return]
: return merge(gdpr, talon)
: ^
: Found 4 errors in 2 files (checked 4 source files)
Nope -- looks like it's completely unawareof =main=, and what's worst, by default (without tweaking =--follow-imports=), these errors would be suppressed.
What if we check =my.twitter= directly?
: $ mypy --namespace-packages --strict -p my.twitter --follow-imports=error
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^~~
: overlay/src/my/twitter: error: Ancestor package "my" ignored [misc]
: overlay/src/my/twitter: note: (Using --follow-imports=error, submodule passed on command line)
: overlay/src/my/twitter/all.py:3: error: Import of "my.twitter.common" ignored [misc]
: from .common import merge
: ^
: overlay/src/my/twitter/all.py:3: note: (Using --follow-imports=error, module not passed on command line)
: overlay/src/my/twitter/all.py:6: error: Import of "my.twitter.gdpr" ignored [misc]
: from . import gdpr
: ^
: overlay/src/my/twitter/all.py: note: In function "tweets":
: overlay/src/my/twitter/all.py:8: error: Returning Any from function declared to return "list[str]" [no-any-return]
: return merge(gdpr, talon)
: ^~~~~~~~~~~~~~~~~~~~~~~~~
: Found 5 errors in 3 files (checked 3 source files)
Now we're also getting =error: Ancestor package "my" ignored [misc]= .. not ideal.
* What if we don't install at all?
Instead of editable install let's try running mypy directly over source files
First let's only check =main= package:
: $ MYPYPATH=main/src mypy --namespace-packages --strict -p my
: main/src/my/twitter/gdpr.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: main/src/my/reddit.py:11: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: Found 2 errors in 2 files (checked 6 source files)
As expected, it found both errors.
Now with overlay as well:
: $ MYPYPATH=overlay/src:main/src mypy --namespace-packages --strict -p my
: overlay/src/my/twitter/all.py:6: note: In module imported here:
: main/src/my/twitter/gdpr.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str") [assignment]
: trigger_mypy_error: str = 123
: ^~~
: overlay/src/my/twitter/talon.py:9: error: Incompatible types in assignment (expression has type "int", variable has type "str")
: [assignment]
: trigger_mypy_error: str = 123
: ^~~
: Found 2 errors in 2 files (checked 4 source files)
Interesting enough, this is slightly better than the editable install (it detected error in =gdpr.py= as well).
But still no =reddit.py= error.
TODO possibly worth submitting to mypy issue tracker as well...
Overall it seems that properly type checking HPI setup as a whole is kinda problematic, especially if the modules actually override/extend base modules.
* Modifying (monkey patching) original module in the overlay
Let's say we want to modify/monkey patch =my.twitter.talon= module from =main=, for example, convert "gdpr" to uppercase, i.e. =tweet.replace('gdpr', 'GDPR')=.
# TODO see overlay2/
I think our options are:
- symlink to the 'parent' packages, e.g. =main= in the case
Alternatively, somehow install =main= under a different name/alias (managed by pip).
This is discussed here: https://github.com/karlicoss/HPI/issues/102
The main upside is that it's relatively simple and (sort of works with mypy).
There are a few big downsides:
- creates a parallel package hierarchy (to the one maintained by pip), symlinks will need to be carefully managed manually
This may not be such a huge deal if you don't have too many overlays.
However this results in problems if you're trying to switch between two different HPI checkouts (e.g. stable and development). If you have symlinks into "stable" from the overlay then stable modules will sometimes be picked up when you're expecting "development" package.
- symlinks pointing outside of the source tree might cause pip install to go into infinite loop
- it modifies the package name
This may potentially result in some confusing behaviours.
One thing I noticed for example is that cachew caches might get duplicated.
- it might not work in all cases or might result in recursive imports
- do not shadow the original module
Basically instead of shadowing via namespace package mechanism and creating identically named module,
create some sort of hook that would patch the original =my.twitter.talon= module from =main=.
The downside is that it's a bit unclear where to do that, we need some sort of entry point?
- it could be some global dynamic hook defined in the overlay, and then executed from =my.core=
However, it's a bit intrusive, and unclear how to handle errors. E.g. what if we're monkey patching a module that we weren't intending to use, don't have dependencies installed and it's crashing?
Perhaps core could support something like =_hook= in each of HPI's modules?
Note that it can't be =my.twitter.all=, since we might want to override =.all= itself.
The downside is is this probably not going to work well with =tmp_config= and such -- we'll need to somehow execute the hook again on reloading the module?
- ideally we'd have something that integrates with =importlib= and executed automatically when module is imported?
TODO explore these:
- https://stackoverflow.com/questions/43571737/how-to-implement-an-import-hook-that-can-modify-the-source-code-on-the-fly-using
- https://github.com/brettlangdon/importhook
This one is pretty intrusive, and has some issues, e.g. https://github.com/brettlangdon/importhook/issues/4
Let's try it:
: $ PYTHONPATH=overlay3/src:main/src python3 -c 'import my.twitter._hook; import my.twitter.all as M; print(M.tweets())'
: [main] my.twitter.all hello
: [main] my.twitter.common hello
: [main] my.twitter.gdpr hello
: EXECUTING IMPORT HOOK!
: ['GDPR tweet 1', 'GDPR tweet 2']
Ok it worked, and seems pretty neat.
However sadly it doesn't work with =tmp_config= (TODO add a proper demo?)
Not sure if it's more of an issue with =tmp_config= implementation (which is very hacky), or =importhook= itself?
In addition, still the question is where to put the hook itself, but in that case even a global one could be fine.
- define hook in =my/twitter/__init__.py=
Basically, use =extend_path= to make it behave like a namespace package, but in addition, patch original =my.twitter.talon=?
: $ cat overlay2/src/my/twitter/__init__.py
: print(f'[overlay2] {__name__} hello')
:
: from pkgutil import extend_path
: __path__ = extend_path(__path__, __name__)
:
: def hack_gdpr_module() -> None:
: from . import gdpr
: tweets_orig = gdpr.tweets
: def tweets_patched():
: return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
: gdpr.tweets = tweets_patched
:
: hack_gdpr_module()
This actually seems to work??
: PYTHONPATH=overlay2/src:main/src python3 -c 'import my.twitter.all as M; print(M.tweets())'
: [overlay2] my.twitter hello
: [main] my.twitter.gdpr hello
: [main] my.twitter.all hello
: [main] my.twitter.common hello
: ['GDPR tweet 1', 'GDPR tweet 2']
However, this doesn't stack, i.e. if the 'parent' overlay had its own =__init__.py=, it won't get called.
- shadow the original module and temporarily modify =__path__= before importing the same module from the parent overlay
This approach is implemented in =my.core.experimental.import_original_module=
TODO demonstrate it properly, but I think that also works in a 'chain' of overlays
Seems like that option is the most promising so far, albeit very hacky.
Note that none of these options work well with mypy (since it's all dynamic hackery), even if you disregard the issues described in the previous sections.
# TODO .pkg files? somewhat interesting... https://github.com/python/cpython/blob/3.12/Lib/pkgutil.py#L395-L410

304
doc/QUERY.md Normal file
View file

@ -0,0 +1,304 @@
`hpi query` is a command line tool for querying the output of any `hpi` function.
```
Usage: hpi query [OPTIONS] FUNCTION_NAME...
This allows you to query the results from one or more functions in HPI
By default this runs with '-o json', converting the results to JSON and
printing them to STDOUT
You can specify '-o pprint' to just print the objects using their repr, or
'-o repl' to drop into a ipython shell with access to the results
While filtering using --order-key datetime, the --after, --before and
--within flags parse the input to their datetime and timedelta equivalents.
datetimes can be epoch time, the string 'now', or an date formatted in the
ISO format. timedelta (durations) are parsed from a similar format to the
GNU 'sleep' command, e.g. 1w2d8h5m20s -> 1 week, 2 days, 8 hours, 5 minutes,
20 seconds
As an example, to query reddit comments I've made in the last month
hpi query --order-type datetime --before now --within 4w my.reddit.all.comments
or...
hpi query --recent 4w my.reddit.all.comments
Can also query within a range. To filter comments between 2016 and 2018:
hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
Options:
-o, --output [json|pprint|repl|gpx]
what to do with the result [default: json]
-s, --stream stream objects from the data source instead
of printing a list at the end
-k, --order-key TEXT order by an object attribute or dict key on
the individual objects returned by the HPI
function
-t, --order-type [datetime|date|int|float]
order by searching for some type on the
iterable
-a, --after TEXT while ordering, filter items for the key or
type larger than or equal to this
-b, --before TEXT while ordering, filter items for the key or
type smaller than this
-w, --within TEXT a range 'after' or 'before' to filter items
by. see above for further explanation
-r, --recent TEXT a shorthand for '--order-type datetime
--reverse --before now --within'. e.g.
--recent 5d
--reverse / --no-reverse reverse the results returned from the
functions
-l, --limit INTEGER limit the number of items returned from the
(functions)
--drop-unsorted if the order of an item can't be determined
while ordering, drop those items from the
results
--wrap-unsorted if the order of an item can't be determined
while ordering, wrap them into an
'Unsortable' object
--warn-exceptions if any errors are returned, print them as
errors on STDERR
--raise-exceptions if any errors are returned (as objects, not
raised) from the functions, raise them
--drop-exceptions ignore any errors returned as objects from
the functions
--help Show this message and exit.
```
This works with any function which returns an iterable, for example `my.coding.commits`, which searches for `git commit`s on your computer:
```bash
hpi query my.coding.commits
```
When run with a module, this does some analysis of the functions in that module and tries to find ones that look like data sources. If it can't figure out which, it prompts you like:
```
Which function should be used from 'my.coding.commits'?
1. commits
2. repos
```
You select the one you want by clicking `1` or `2` on your keyboard. Otherwise, you can provide a fully qualified path, like:
```
hpi query my.coding.commits.repos
```
The corresponding `repos` function this queries is defined in [`my/coding/commits.py`](../my/coding/commits.py)
### Ordering/Filtering/Streaming
By default, this just returns the items in the order they were returned by the function. This allows you to filter by specifying a `--order-key`, or `--order-type`. For example, to get the 10 most recent commits. `--order-type datetime` will try to automatically figure out which attribute to use. If it chooses the wrong one (since `Commit`s have both a `committed_dt` and `authored_dt`), you could tell it which to use. For example, to scan my computer and find the most recent commit I made:
```
hpi query my.coding.commits.commits --order-key committed_dt --limit 1 --reverse --output pprint --stream
Commit(committed_dt=datetime.datetime(2023, 4, 14, 23, 9, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
authored_dt=datetime.datetime(2023, 4, 14, 23, 4, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
message='sources.smscalls: propagate errors if there are breaking '
'schema changes',
repo='/home/username/Repos/promnesia-fork',
sha='22a434fca9a28df9b0915ccf16368df129d2c9ce',
ref='refs/heads/smscalls-handle-result')
```
To instead limit in some range, you can use `--before` and `--within` to filter by a range. For example, to get all the commits I committed in the last day:
```
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d
```
That prints a a list of `Commit` as JSON objects. You could also use `--output pprint` to pretty-print the objects or `--output repl` drop into a REPL.
To process the JSON, you can pipe it to [`jq`](https://github.com/stedolan/jq). I often use `jq length` to get the count of some output:
```
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d | jq length
6
```
Because grabbing data `--before now` is such a common use case, the `--recent` flag is a shorthand for `--order-type datetime --reverse --before now --within`. The same as above, to get the commits from the last day:
```
hpi query my.coding.commits.commits --recent 1d | jq length
6
```
To select a range of commits, you can use `--after` and `--before`, passing ISO or epoch timestamps. Those can be full `datetimes` (`2021-01-01T00:05:30`) or just dates (`2021-01-01`). For example, to get all the commits I made on January 1st, 2021:
```
hpi query my.coding.commits.commits --order-type datetime --after 2021-01-01 --before 2021-01-02 | jq length
1
```
If you have [`dateparser`](https://github.com/scrapinghub/dateparser#how-to-use) installed, this supports dozens more natural language formats:
```
hpi query my.coding.commits.commits --order-type datetime --after 'last week' --before 'day before yesterday' | jq length
28
```
If you're having issues ordering because there are exceptions in your results not all data is sortable (may have `None` for some attributes), you can use `--drop-unsorted` to drop those items from the results, or `--drop-exceptions` to remove the exceptions
You can also stream the results, which is useful for functions that take a while to process or have a lot of data. For example, if you wanted to pick a sha hash from a particular repo, you could combine `jq` to `select` and pick that attribute from the JSON:
```
hpi query my.coding.commits.commits --recent 30d --stream | jq 'select(.repo | contains("HPI"))' | jq '.sha' -r
4afa899c8b365b3c10e468f6279c02e316d3b650
40de162fab741df594b4d9651348ee46ee021e9b
e1cb229913482074dc5523e57ef0acf6e9ec2bb2
87c13defd131e39292b93dcea661d3191222dace
02c738594f2cae36ca4fab43cf9533fe6aa89396
0b3a2a6ef3a9e4992771aaea0252fb28217b814a
84817ce72d208038b66f634d4ceb6e3a4c7ec5e9
47992b8e046d27fc5141839179f06f925c159510
425615614bd508e28ccceb56f43c692240e429ab
eed8f949460d768fb1f1c4801e9abab58a5f9021
d26ad7d9ce6a4718f96346b994c3c1cd0d74380c
aec517e53c6ac022f2b4cc91261daab5651cebf0
44b75a88fdfc7af132f61905232877031ce32fcb
b0ff6f29dd2846e97f8aa85a2ca73736b03254a8
```
`jq`s `select` function acts on a stream of JSON objects, not a list, so it filters the output of `hpi query` the objects are generated (the goal here is to conserve memory as items which aren't needed are filtered). The alternative would be to print the entire JSON list at the end, like:
`hpi query my.coding.commits.commits --recent 30d | jq '.[] | select(.repo | contains("Repos/HPI"))' | jq '.sha' -r`, using `jq '.[]'` to convert the JSON list into a stream of JSON objects.
## Usage on non-HPI code
The command can accept any qualified function name, so this could for example be used to check the output of [`promnesia`](https://github.com/karlicoss/promnesia) sources:
```
hpi query promnesia.sources.smscalls | jq length
371
```
This can be used on any function that produces an `Iterator`/`Generator` like output, as long as it can be called with no arguments.
## GPX
The `hpi query` command can also be used with the `--output gpx` flag to generate gpx files from a list of locations, like the ones defined in the `my.location` package. This could be used to extract some date range and create a `gpx` file which can then be visualized by a GUI application.
This prints the contents for the `gpx` file to STDOUT, and prints warnings for any objects it could not convert to locations to STDERR, so pipe STDOUT to a output file, like `>out.gpx`
```
hpi query my.location.all --after '2021-07-01T00:00:00' --before '2021-07-05T00:00:00' --order-type datetime --output gpx >out.gpx
```
If you want to ignore any errors, you can use `--drop-exceptions`.
To preview, you can use something like [`qgis`](https://qgis.org/en/site/) or for something easier more lightweight, [`gpxsee`](https://github.com/tumic0/GPXSee):
`gpxsee out.gpx`:
<img src="https://user-images.githubusercontent.com/7804791/232249184-7e203ee6-a3ec-4053-800c-751d2c28e690.png" width=500 alt="chicago trip" />
(Sidenote: this is [`@purarue`](https://github.com/purarue/)s locations, on a trip to Chicago)
## Python reference
The `hpi query` command is a CLI wrapper around the code in [`query.py`](../my/core/query.py) and [`query_range.py`](../my/core/query_range.py). The `select` function is the core of this, and `select_range` lets you specify dates, timedelta, start-end ranges, and other CLI-specific code.
`my.core.query.select`:
```
A function to query, order, sort and filter items from one or more sources
This supports iterables and lists of mixed types (including handling errors),
by allowing you to provide custom predicates (functions) which can sort
by a function, an attribute, dict key, or by the attributes values.
Since this supports mixed types, there's always a possibility
of KeyErrors or AttributeErrors while trying to find some value to order by,
so this provides multiple mechanisms to deal with that
'where' lets you filter items before ordering, to remove possible errors
or filter the iterator by some condition
There are multiple ways to instruct select on how to order items. The most
flexible is to provide an 'order_by' function, which takes an item in the
iterator, does any custom checks you may want and then returns the value to sort by
'order_key' is best used on items which have a similar structure, or have
the same attribute name for every item in the iterator. If you have a
iterator of objects whose datetime is accessed by the 'timestamp' attribute,
supplying order_key='timestamp' would sort by that (dictionary or attribute) key
'order_value' is the most confusing, but often the most useful. Instead of
testing against the keys of an item, this allows you to write a predicate
(function) to test against its values (dictionary, NamedTuple, dataclass, object).
If you had an iterator of mixed types and wanted to sort by the datetime,
but the attribute to access the datetime is different on each type, you can
provide `order_value=lambda v: isinstance(v, datetime)`, and this will
try to find that value for each type in the iterator, to sort it by
the value which is received when the predicate is true
'order_value' is often used in the 'hpi query' interface, because of its brevity.
Just given the input function, this can typically sort it by timestamp with
no human intervention. It can sort of be thought as an educated guess,
but it can always be improved by providing a more complete guess function
Note that 'order_value' is also the most computationally expensive, as it has
to copy the iterator in memory (using itertools.tee) to determine how to order it
in memory
The 'drop_exceptions', 'raise_exceptions', 'warn_exceptions' let you ignore or raise
when the src contains exceptions. The 'warn_func' lets you provide a custom function
to call when an exception is encountered instead of using the 'warnings' module
src: an iterable of mixed types, or a function to be called,
as the input to this function
where: a predicate which filters the results before sorting
order_by: a function which when given an item in the src,
returns the value to sort by. Similar to the 'key' value
typically passed directly to 'sorted'
order_key: a string which represents a dict key or attribute name
to use as they key to sort by
order_value: predicate which determines which attribute on an ADT-like item to sort by,
when given its value. lambda o: isinstance(o, datetime) is commonly passed to sort
by datetime, without knowing the attributes or interface for the items in the src
default: while ordering, if the order for an object cannot be determined,
use this as the default value
reverse: reverse the order of the resulting iterable
limit: limit the results to this many items
drop_unsorted: before ordering, drop any items from the iterable for which a
order could not be determined. False by default
wrap_unsorted: before ordering, wrap any items into an 'Unsortable' object. Place
them at the front of the list. True by default
drop_exceptions: ignore any exceptions from the src
raise_exceptions: raise exceptions when received from the input src
```
`my.core.query_range.select_range`:
```
A specialized select function which offers generating functions
to filter/query ranges from an iterable
order_key and order_value are used in the same way they are in select
If you specify order_by_value_type, it tries to search for an attribute
on each object/type which has that type, ordering the iterable by that value
unparsed_range is a tuple of length 3, specifying 'after', 'before', 'duration',
i.e. some start point to allow the computed value we're ordering by, some
end point and a duration (can use the RangeTuple NamedTuple to construct one)
(this is typically parsed/created in my.core.__main__, from CLI flags
If you specify a range, drop_unsorted is forced to be True
```
Those can be imported and accept any sort of iterator, `hpi query` just defaults to the output of functions here. As an example, see [`listens`](https://github.com/purarue/HPI-personal/blob/master/scripts/listens) which just passes an generator (iterator) as the first argument to `query_range`

View file

@ -105,10 +105,11 @@ You can also install some optional packages
They aren't necessary, but will improve your experience. At the moment these are:
- [[https://github.com/karlicoss/cachew][cachew]]: automatic caching library, which can greatly speedup data access
- [[https://github.com/metachris/logzero][logzero]]: a nice logging library, supporting colors
- [[https://github.com/ijl/orjson][orjson]]: a library for serializing data to JSON, used in ~my.core.serialize~ and the ~hpi query~ interface
- [[https://github.com/karlicoss/cachew][cachew]]: automatic caching library, which can greatly speedup data access
- [[https://github.com/python/mypy][mypy]]: mypy is used for checking configs and troubleshooting
- [[https://github.com/borntyping/python-colorlog][colorlog]]: colored formatter for ~logging~ module
- [[https://github.com/Rockhopper-Technologies/enlighten]]: console progress bar library
* Setting up modules
This is an *optional step* as few modules work without extra setup.
@ -191,7 +192,13 @@ HPI comes with a command line tool that can help you detect potential issues. Ru
If you only have a few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.
If you're having issues with ~cachew~ or want to show logs to troubleshoot what may be happening, you can pass the debug flag (e.g., ~hpi --debug doctor my.module_name~) or set the ~HPI_LOGS~ environment variable (e.g., ~HPI_LOGS=debug hpi query my.module_name~) to print all logs, including the ~cachew~ dependencies. ~HPI_LOGS~ could also be used to silence ~info~ logs, like ~HPI_LOGS=warning hpi ...~
If you're having issues with ~cachew~ or want to show logs to troubleshoot what may be happening, you can pass the debug flag (e.g., ~hpi --debug doctor my.module_name~) or set the ~LOGGING_LEVEL_HPI~ environment variable (e.g., ~LOGGING_LEVEL_HPI=debug hpi query my.module_name~) to print all logs, including the ~cachew~ dependencies. ~LOGGING_LEVEL_HPI~ could also be used to silence ~info~ logs, like ~LOGGING_LEVEL_HPI=warning hpi ...~
If you want to enable logs for a particular module, you can use the
~LOGGING_LEVEL_~ prefix and then the module name with underscores, like
~LOGGING_LEVEL_my_hypothesis=debug hpi query my.hypothesis~
If you want ~HPI~ to autocomplete the module names for you, this comes with shell completion, see [[../misc/completion/][misc/completion]]
If you have any ideas on how to improve it, please let me know!
@ -380,7 +387,7 @@ But there is an extra caveat: rexport is already coming with nice [[https://gith
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/purarue/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
** Twitter
@ -450,7 +457,7 @@ connect the data with other apps and libraries!
See more in [[file:../README.org::#how-do-you-use-it]["How do you use it?"]] section.
Also check out [[https://beepb00p.xyz/myinfra.html#hpi][my personal infrastructure map]] to see wher I'm using HPI.
Also check out [[https://beepb00p.xyz/myinfra.html#hpi][my personal infrastructure map]] to see where I'm using HPI.
* Adding/modifying modules
# TODO link to 'overlays' documentation?

View file

@ -0,0 +1,4 @@
#!/bin/bash
set -eux
pip3 install --user "$@" -e main/
pip3 install --user "$@" -e overlay/

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-main',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,11 @@
print(f'[main] {__name__} hello')
def upvotes() -> list[str]:
return [
'reddit upvote1',
'reddit upvote2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,7 @@
print(f'[main] {__name__} hello')
from .common import merge
def tweets() -> list[str]:
from . import gdpr
return merge(gdpr)

View file

@ -0,0 +1,11 @@
print(f'[main] {__name__} hello')
from typing import Protocol
class Source(Protocol):
def tweets(self) -> list[str]:
...
def merge(*sources: Source) -> list[str]:
from itertools import chain
return list(chain.from_iterable(src.tweets() for src in sources))

View file

@ -0,0 +1,9 @@
print(f'[main] {__name__} hello')
def tweets() -> list[str]:
return [
'gdpr tweet 1',
'gdpr tweet 2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,8 @@
print(f'[overlay] {__name__} hello')
from .common import merge
def tweets() -> list[str]:
from . import gdpr
from . import talon
return merge(gdpr, talon)

View file

@ -0,0 +1,9 @@
print(f'[overlay] {__name__} hello')
def tweets() -> list[str]:
return [
'talon tweet 1',
'talon tweet 2',
]
trigger_mypy_error: str = 123

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay2',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,13 @@
print(f'[overlay2] {__name__} hello')
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)
def hack_gdpr_module() -> None:
from . import gdpr
tweets_orig = gdpr.tweets
def tweets_patched():
return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
gdpr.tweets = tweets_patched
hack_gdpr_module()

View file

@ -0,0 +1,17 @@
from setuptools import setup, find_namespace_packages # type: ignore
def main() -> None:
pkgs = find_namespace_packages('src')
pkg = min(pkgs)
setup(
name='hpi-overlay3',
zip_safe=False,
packages=pkgs,
package_dir={'': 'src'},
package_data={pkg: ['py.typed']},
)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,9 @@
import importhook
@importhook.on_import('my.twitter.gdpr')
def on_import(gdpr):
print("EXECUTING IMPORT HOOK!")
tweets_orig = gdpr.tweets
def tweets_patched():
return [t.replace('gdpr', 'GDPR') for t in tweets_orig()]
gdpr.tweets = tweets_patched

View file

@ -32,6 +32,6 @@ ignore =
#
# as a reference:
# https://github.com/seanbreckenridge/cookiecutter-template/blob/master/%7B%7Bcookiecutter.module_name%7D%7D/setup.cfg
# https://github.com/purarue/cookiecutter-template/blob/master/%7B%7Bcookiecutter.module_name%7D%7D/setup.cfg
# and this https://github.com/karlicoss/HPI/pull/151
# find ./my | entr flake8 --ignore=E402,E501,E741,W503,E266,E302,E305,E203,E261,E252,E251,E221,W291,E225,E303,E702,E202,F841,E731,E306,E127 E722,E231 my | grep -v __NOT_HPI_MODULE__

View file

@ -46,7 +46,7 @@ check '2016-12-13 Tue 20:23.*TIL:.*pypi.python.org/pypi/coloredlogs'
# https://twitter.com/karlicoss/status/472151454044917761
# archive isn't explaning images by default
# archive isn't expanding images by default
check '2014-05-29 Thu 23:04.*Выколол сингулярность.*pic.twitter.com/M6XRN1n7KW'
@ -76,7 +76,7 @@ check '2014-12-31 Wed 21:00.*2015 заебал'
check '2021-05-14 Fri 21:08.*RT @SNunoPerez: Me explaining Rage.*'
# make sure there is a single occurence (hence, correct tzs)
# make sure there is a single occurrence (hence, correct tzs)
check 'A short esoteric Python'
# https://twitter.com/karlicoss/status/1499174823272099842
check 'It would be a really good time for countries'

View file

@ -10,21 +10,23 @@ eval "$(_HPI_COMPLETE=fish_source hpi)" # in ~/.config/fish/config.fish
That is slightly slower since its generating the completion code on the fly -- see [click docs](https://click.palletsprojects.com/en/8.0.x/shell-completion/#enabling-completion) for more info
To use the completions here:
To use the generated completion files in this repository, you need to source the file in `./bash`, `./zsh`, or `./fish` depending on your shell.
If you don't have HPI cloned locally, after installing `HPI` you can generate the file yourself using one of the commands above. For example, for `bash`: `_HPI_COMPLETE=bash_source hpi > ~/.config/hpi_bash_completion`, and then source it like `source ~/.config/hpi_bash_completion`
### bash
Put `source /path/to/bash/_hpi` in your `~/.bashrc`
Put `source /path/to/hpi/repo/misc/completion/bash/_hpi` in your `~/.bashrc`
### zsh
You can either source the file:
`source /path/to/zsh/_hpi`
`source /path/to/hpi/repo/misc/completion/zsh/_hpi`
..or add the directory to your `fpath` to load it lazily:
`fpath=("/path/to/zsh/" "${fpath[@]}")` (Note: the directory, not the script `_hpi`)
`fpath=("/path/to/hpi/repo/misc/completion/zsh/" "${fpath[@]}")` (Note: the directory, not the script `_hpi`)
If your zsh configuration doesn't automatically run `compinit`, after modifying your `fpath` you should:

View file

@ -1,9 +1,5 @@
function _hpi_completion;
set -l response;
for value in (env _HPI_COMPLETE=fish_complete COMP_WORDS=(commandline -cp) COMP_CWORD=(commandline -t) hpi);
set response $response $value;
end;
set -l response (env _HPI_COMPLETE=fish_complete COMP_WORDS=(commandline -cp) COMP_CWORD=(commandline -t) hpi);
for completion in $response;
set -l metadata (string split "," $completion);

View file

@ -31,5 +31,11 @@ _hpi_completion() {
fi
}
compdef _hpi_completion hpi;
if [[ $zsh_eval_context[-1] == loadautofunc ]]; then
# autoload from fpath, call function directly
_hpi_completion "$@"
else
# eval/source/. command, register function for later
compdef _hpi_completion hpi
fi

View file

@ -2,19 +2,22 @@
[[https://github.com/nomeata/arbtt#arbtt-the-automatic-rule-based-time-tracker][Arbtt]] time tracking
'''
from __future__ import annotations
REQUIRES = ['ijson', 'cffi']
# NOTE likely also needs libyajl2 from apt or elsewhere?
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from pathlib import Path
from typing import Sequence, Iterable, List, Optional
def inputs() -> Sequence[Path]:
try:
from my.config import arbtt as user_config
except ImportError:
from .core.warnings import low
from my.core.warnings import low
low("Couldn't find 'arbtt' config section, falling back to the default capture.log (usually in HOME dir). Add 'arbtt' section with logfiles = '' to suppress this warning.")
return []
else:
@ -22,8 +25,9 @@ def inputs() -> Sequence[Path]:
return get_files(user_config.logfiles)
from .core import dataclass, Json, PathIsh, datetime_aware
from .core.common import isoparse
from my.core import Json, PathIsh, datetime_aware
from my.core.compat import fromisoformat
@dataclass
@ -39,6 +43,7 @@ class Entry:
@property
def dt(self) -> datetime_aware:
# contains utc already
# TODO after python>=3.11, could just use fromisoformat
ds = self.json['date']
elen = 27
lds = len(ds)
@ -46,13 +51,13 @@ class Entry:
# ugh. sometimes contains less that 6 decimal points
ds = ds[:-1] + '0' * (elen - lds) + 'Z'
elif lds > elen:
# ahd sometimes more...
# and sometimes more...
ds = ds[:elen - 1] + 'Z'
return isoparse(ds)
return fromisoformat(ds)
@property
def active(self) -> Optional[str]:
def active(self) -> str | None:
# NOTE: WIP, might change this in the future...
ait = (w for w in self.json['windows'] if w['active'])
a = next(ait, None)
@ -71,17 +76,18 @@ class Entry:
def entries() -> Iterable[Entry]:
inps = list(inputs())
base: List[PathIsh] = ['arbtt-dump', '--format=json']
base: list[PathIsh] = ['arbtt-dump', '--format=json']
cmds: List[List[PathIsh]]
cmds: list[list[PathIsh]]
if len(inps) == 0:
cmds = [base] # rely on default
else:
# otherise, 'merge' them
cmds = [base + ['--logfile', f] for f in inps]
# otherwise, 'merge' them
cmds = [[*base, '--logfile', f] for f in inps]
from subprocess import PIPE, Popen
import ijson.backends.yajl2_cffi as ijson # type: ignore
from subprocess import Popen, PIPE
for cmd in cmds:
with Popen(cmd, stdout=PIPE) as p:
out = p.stdout; assert out is not None
@ -90,8 +96,8 @@ def entries() -> Iterable[Entry]:
def fill_influxdb() -> None:
from .core.influxdb import magic_fill
from .core.freezer import Freezer
from .core.influxdb import magic_fill
freezer = Freezer(Entry)
fit = (freezer.freeze(e) for e in entries())
# TODO crap, influxdb doesn't like None https://github.com/influxdata/influxdb/issues/7722
@ -103,6 +109,8 @@ def fill_influxdb() -> None:
magic_fill(fit, name=f'{entries.__module__}:{entries.__name__}')
from .core import stat, Stats
from .core import Stats, stat
def stats() -> Stats:
return stat(entries)

View file

@ -1,34 +1,70 @@
#!/usr/bin/python3
"""
[[https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger][Bluemaestro]] temperature/humidity/pressure monitor
"""
from __future__ import annotations
# todo most of it belongs to DAL... but considering so few people use it I didn't bother for now
from datetime import datetime, timedelta
from pathlib import Path
import re
import sqlite3
from typing import Iterable, Sequence, Set, Optional
from abc import abstractmethod
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from datetime import datetime, timedelta
from pathlib import Path
from typing import Protocol
from my.core import get_files, LazyLogger, dataclass, Res
import pytz
from my.core import (
Paths,
Res,
Stats,
get_files,
make_logger,
stat,
unwrap,
)
from my.core.cachew import mcachew
from my.core.pandas import DataFrameT, as_dataframe
from my.core.sqlite import sqlite_connect_immutable
from my.config import bluemaestro as config
class config(Protocol):
@property
@abstractmethod
def export_path(self) -> Paths:
raise NotImplementedError
@property
def tz(self) -> pytz.BaseTzInfo:
# fixme: later, rely on the timezone provider
# NOTE: the timezone should be set with respect to the export date!!!
return pytz.timezone('Europe/London')
# TODO when I change tz, check the diff
# todo control level via env variable?
# i.e. HPI_LOGGING_MY_BLUEMAESTRO_LEVEL=debug
logger = LazyLogger(__name__, level='debug')
def make_config() -> config:
from my.config import bluemaestro as user_config
class combined_config(user_config, config): ...
return combined_config()
logger = make_logger(__name__)
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
cfg = make_config()
return get_files(cfg.export_path)
Celsius = float
Percent = float
mBar = float
@dataclass
class Measurement:
dt: datetime # todo aware/naive
@ -38,41 +74,39 @@ class Measurement:
dewpoint: Celsius
# fixme: later, rely on the timezone provider
# NOTE: the timezone should be set with respect to the export date!!!
import pytz # type: ignore
tz = pytz.timezone('Europe/London')
# TODO when I change tz, check the diff
def is_bad_table(name: str) -> bool:
# todo hmm would be nice to have a hook that can patch any module up to
delegate = getattr(config, 'is_bad_table', None)
return False if delegate is None else delegate(name)
from my.core.cachew import cache_dir
from my.core.common import mcachew
@mcachew(depends_on=lambda: inputs(), cache_path=cache_dir('bluemaestro'))
@mcachew(depends_on=inputs)
def measurements() -> Iterable[Res[Measurement]]:
# todo ideally this would be via arguments... but needs to be lazy
dbs = inputs()
cfg = make_config()
tz = cfg.tz
last: Optional[datetime] = None
# todo ideally this would be via arguments... but needs to be lazy
paths = inputs()
total = len(paths)
width = len(str(total))
last: datetime | None = None
# tables are immutable, so can save on processing..
processed_tables: Set[str] = set()
for f in dbs:
logger.debug('processing %s', f)
processed_tables: set[str] = set()
for idx, path in enumerate(paths):
logger.info(f'processing [{idx:>{width}}/{total:>{width}}] {path}')
tot = 0
new = 0
# todo assert increasing timestamp?
with sqlite_connect_immutable(f) as db:
db_dt: Optional[datetime] = None
with sqlite_connect_immutable(path) as db:
db_dt: datetime | None = None
try:
datas = db.execute(f'SELECT "{f.name}" as name, Time, Temperature, Humidity, Pressure, Dewpoint FROM data ORDER BY log_index')
datas = db.execute(
f'SELECT "{path.name}" as name, Time, Temperature, Humidity, Pressure, Dewpoint FROM data ORDER BY log_index'
)
oldfmt = True
db_dts = list(db.execute('SELECT last_download FROM info'))[0][0]
[(db_dts,)] = db.execute('SELECT last_download FROM info')
if db_dts == 'N/A':
# ??? happens for 20180923-20180928
continue
@ -105,7 +139,7 @@ def measurements() -> Iterable[Res[Measurement]]:
processed_tables |= set(log_tables)
# todo use later?
frequencies = [list(db.execute(f'SELECT interval from {t.replace("_log", "_meta")}'))[0][0] for t in log_tables]
frequencies = [list(db.execute(f'SELECT interval from {t.replace("_log", "_meta")}'))[0][0] for t in log_tables] # noqa: RUF015
# todo could just filter out the older datapoints?? dunno.
@ -121,7 +155,7 @@ def measurements() -> Iterable[Res[Measurement]]:
oldfmt = False
db_dt = None
for i, (name, tsc, temp, hum, pres, dewp) in enumerate(datas):
for (name, tsc, temp, hum, pres, dewp) in datas:
if is_bad_table(name):
continue
@ -145,7 +179,7 @@ def measurements() -> Iterable[Res[Measurement]]:
upper = timedelta(days=10) # kinda arbitrary
if not (db_dt - lower < dt < db_dt + timedelta(days=10)):
# todo could be more defenive??
yield RuntimeError('timestamp too far out', f, name, db_dt, dt)
yield RuntimeError('timestamp too far out', path, name, db_dt, dt)
continue
# err.. sometimes my values are just interleaved with these for no apparent reason???
@ -153,7 +187,7 @@ def measurements() -> Iterable[Res[Measurement]]:
yield RuntimeError('the weird sensor bug')
continue
assert -60 <= temp <= 60, (f, dt, temp)
assert -60 <= temp <= 60, (path, dt, temp)
##
tot += 1
@ -170,7 +204,7 @@ def measurements() -> Iterable[Res[Measurement]]:
dewpoint=dewp,
)
yield p
logger.debug('%s: new %d/%d', f, new, tot)
logger.debug(f'{path}: new {new}/{tot}')
# logger.info('total items: %d', len(merged))
# for k, v in merged.items():
# # TODO shit. quite a few of them have varying values... how is that freaking possible????
@ -180,12 +214,11 @@ def measurements() -> Iterable[Res[Measurement]]:
# for k, v in merged.items():
# yield Point(dt=k, temp=v) # meh?
from my.core import stat, Stats
def stats() -> Stats:
return stat(measurements)
from my.core.pandas import DataFrameT, as_dataframe
def dataframe() -> DataFrameT:
"""
%matplotlib gtk
@ -200,6 +233,7 @@ def dataframe() -> DataFrameT:
def fill_influxdb() -> None:
from my.core import influxdb
influxdb.fill(measurements(), measurement=__name__)
@ -207,7 +241,6 @@ def check() -> None:
temps = list(measurements())
latest = temps[:-2]
from my.core.error import unwrap
prev = unwrap(latest[-2]).dt
last = unwrap(latest[-1]).dt

View file

@ -2,41 +2,42 @@
Blood tracking (manual org-mode entries)
"""
from __future__ import annotations
from collections.abc import Iterable
from datetime import datetime
from typing import Iterable, NamedTuple, Optional
from typing import NamedTuple
from ..core.error import Res
from ..core.orgmode import parse_org_datetime, one_table
import pandas as pd # type: ignore
import orgparse
import pandas as pd
from my.config import blood as config # type: ignore[attr-defined]
from ..core.error import Res
from ..core.orgmode import one_table, parse_org_datetime
class Entry(NamedTuple):
dt: datetime
ketones : Optional[float]=None
glucose : Optional[float]=None
ketones : float | None=None
glucose : float | None=None
vitamin_d : Optional[float]=None
vitamin_b12 : Optional[float]=None
vitamin_d : float | None=None
vitamin_b12 : float | None=None
hdl : Optional[float]=None
ldl : Optional[float]=None
triglycerides: Optional[float]=None
hdl : float | None=None
ldl : float | None=None
triglycerides: float | None=None
source : Optional[str]=None
extra : Optional[str]=None
source : str | None=None
extra : str | None=None
Result = Res[Entry]
def try_float(s: str) -> Optional[float]:
def try_float(s: str) -> float | None:
l = s.split()
if len(l) == 0:
return None
@ -105,6 +106,7 @@ def blood_tests_data() -> Iterable[Result]:
def data() -> Iterable[Result]:
from itertools import chain
from ..core.error import sort_res_by
datas = chain(glucose_ketones_data(), blood_tests_data())
return sort_res_by(datas, key=lambda e: e.dt)

View file

@ -7,10 +7,10 @@ from ...core.pandas import DataFrameT, check_dataframe
@check_dataframe
def dataframe() -> DataFrameT:
# this should be somehow more flexible...
import pandas as pd
from ...endomondo import dataframe as EDF
from ...runnerup import dataframe as RDF
import pandas as pd # type: ignore
return pd.concat([
EDF(),
RDF(),

View file

@ -3,7 +3,6 @@ Cardio data, filtered from various data sources
'''
from ...core.pandas import DataFrameT, check_dataframe
CARDIO = {
'Running',
'Running, treadmill',

View file

@ -5,16 +5,18 @@ This is probably too specific to my needs, so later I will move it away to a per
For now it's worth keeping it here as an example and perhaps utility functions might be useful for other HPI modules.
'''
from datetime import datetime, timedelta
from typing import Optional
from __future__ import annotations
from ...core.pandas import DataFrameT, check_dataframe as cdf
from ...core.orgmode import collect, Table, parse_org_datetime, TypedTable
from datetime import datetime, timedelta
import pytz
from my.config import exercise as config
from ...core.orgmode import Table, TypedTable, collect, parse_org_datetime
from ...core.pandas import DataFrameT
from ...core.pandas import check_dataframe as cdf
import pytz
# FIXME how to attach it properly?
tz = pytz.timezone('Europe/London')
@ -78,7 +80,7 @@ def cross_trainer_manual_dataframe() -> DataFrameT:
'''
Only manual org-mode entries
'''
import pandas as pd # type: ignore[import]
import pandas as pd
df = pd.DataFrame(cross_trainer_data())
return df
@ -91,7 +93,7 @@ def dataframe() -> DataFrameT:
'''
Attaches manually logged data (which Endomondo can't capture) and attaches it to Endomondo
'''
import pandas as pd # type: ignore[import]
import pandas as pd
from ...endomondo import dataframe as EDF
edf = EDF()
@ -105,7 +107,7 @@ def dataframe() -> DataFrameT:
rows = []
idxs = [] # type: ignore[var-annotated]
NO_ENDOMONDO = 'no endomondo matches'
for i, row in mdf.iterrows():
for _i, row in mdf.iterrows():
rd = row.to_dict()
mdate = row['date']
if pd.isna(mdate):
@ -114,7 +116,7 @@ def dataframe() -> DataFrameT:
rows.append(rd) # presumably has an error set
continue
idx: Optional[int]
idx: int | None
close = edf[edf['start_time'].apply(lambda t: pd_date_diff(t, mdate)).abs() < _DELTA]
if len(close) == 0:
idx = None
@ -146,7 +148,7 @@ def dataframe() -> DataFrameT:
# todo careful about 'how'? we need it to preserve the errors
# maybe pd.merge is better suited for this??
df = edf.join(mdf, how='outer', rsuffix='_manual')
# todo reindex? so we dont' have Nan leftovers
# todo reindex? so we don't have Nan leftovers
# todo set date anyway? maybe just squeeze into the index??
noendo = df['error'] == NO_ENDOMONDO
@ -163,7 +165,9 @@ def dataframe() -> DataFrameT:
# TODO wtf?? where is speed coming from??
from ...core import stat, Stats
from ...core import Stats, stat
def stats() -> Stats:
return stat(cross_trainer_data)

View file

@ -1,5 +1,6 @@
from ...core import stat, Stats
from ...core.pandas import DataFrameT, check_dataframe as cdf
from ...core import Stats, stat
from ...core.pandas import DataFrameT
from ...core.pandas import check_dataframe as cdf
class Combine:
@ -7,8 +8,8 @@ class Combine:
self.modules = modules
@cdf
def dataframe(self, with_temperature: bool=True) -> DataFrameT:
import pandas as pd # type: ignore
def dataframe(self, *, with_temperature: bool=True) -> DataFrameT:
import pandas as pd
# todo include 'source'?
df = pd.concat([m.dataframe() for m in self.modules])
@ -17,15 +18,21 @@ class Combine:
bdf = BM.dataframe()
temp = bdf['temp']
# sort index and drop nans, otherwise indexing with [start: end] gonna complain
temp = pd.Series(
temp.values,
index=pd.to_datetime(temp.index, utc=True)
).sort_index()
temp = temp.loc[temp.index.dropna()]
def calc_avg_temperature(row):
start = row['sleep_start']
end = row['sleep_end']
if pd.isna(start) or pd.isna(end):
return None
between = (start <= temp.index) & (temp.index <= end)
# on no temp data, returns nan, ok
return temp[between].mean()
return temp[start: end].mean()
df['avg_temp'] = df.apply(calc_avg_temperature, axis=1)
return df

View file

@ -1,7 +1,6 @@
from ... import jawbone
from ... import emfit
from ... import emfit, jawbone
from .common import Combine
_combined = Combine([
jawbone,
emfit,

View file

@ -2,21 +2,29 @@
Weight data (manually logged)
'''
from collections.abc import Iterator
from dataclasses import dataclass
from datetime import datetime
from typing import NamedTuple, Iterator
from typing import Any
from ..core import LazyLogger
from ..core.error import Res, set_error_datetime, extract_error_datetime
from my import orgmode
from my.core import make_logger
from my.core.error import Res, extract_error_datetime, set_error_datetime
from .. import orgmode
from my.config import weight as config # type: ignore[attr-defined]
config = Any
log = LazyLogger('my.body.weight')
def make_config() -> config:
from my.config import weight as user_config # type: ignore[attr-defined]
return user_config()
class Entry(NamedTuple):
log = make_logger(__name__)
@dataclass
class Entry:
dt: datetime
value: float
# TODO comment??
@ -26,6 +34,8 @@ Result = Res[Entry]
def from_orgmode() -> Iterator[Result]:
cfg = make_config()
orgs = orgmode.query()
for o in orgmode.query().all():
if 'weight' not in o.tags:
@ -46,7 +56,7 @@ def from_orgmode() -> Iterator[Result]:
yield e
continue
# FIXME use timezone provider
created = config.default_timezone.localize(created)
created = cfg.default_timezone.localize(created)
assert created is not None # ??? somehow mypy wasn't happy?
yield Entry(
dt=created,
@ -56,7 +66,8 @@ def from_orgmode() -> Iterator[Result]:
def make_dataframe(data: Iterator[Result]):
import pandas as pd # type: ignore
import pandas as pd
def it():
for e in data:
if isinstance(e, Exception):
@ -70,8 +81,9 @@ def make_dataframe(data: Iterator[Result]):
'dt': e.dt,
'weight': e.value,
}
df = pd.DataFrame(it())
df.set_index('dt', inplace=True)
df = df.set_index('dt')
# TODO not sure about UTC??
df.index = pd.to_datetime(df.index, utc=True)
return df
@ -81,6 +93,7 @@ def dataframe():
entries = from_orgmode()
return make_dataframe(entries)
# TODO move to a submodule? e.g. my.body.weight.orgmode?
# so there could be more sources
# not sure about my.body thing though

View file

@ -1,7 +1,6 @@
from ..core import warnings
from my.core import warnings
warnings.high('my.books.kobo is deprecated! Please use my.kobo instead!')
from ..core.util import __NOT_HPI_MODULE__
from ..kobo import * # type: ignore[no-redef]
from my.core.util import __NOT_HPI_MODULE__
from my.kobo import *

View file

@ -1,12 +1,13 @@
"""
Parses active browser history by backing it up with [[http://github.com/seanbreckenridge/sqlite_backup][sqlite_backup]]
Parses active browser history by backing it up with [[http://github.com/purarue/sqlite_backup][sqlite_backup]]
"""
REQUIRES = ["browserexport", "sqlite_backup"]
from dataclasses import dataclass
from my.config import browser as user_config
from my.core import Paths, dataclass
from my.core import Paths
@dataclass
@ -18,16 +19,19 @@ class config(user_config.active_browser):
export_path: Paths
from collections.abc import Iterator, Sequence
from pathlib import Path
from typing import Sequence, Iterator
from my.core import get_files, Stats
from browserexport.merge import read_visits, Visit
from browserexport.merge import Visit, read_visits
from sqlite_backup import sqlite_backup
from my.core import Stats, get_files, make_logger
logger = make_logger(__name__)
from .common import _patch_browserexport_logs
_patch_browserexport_logs()
_patch_browserexport_logs(logger.level)
def inputs() -> Sequence[Path]:

View file

@ -1,9 +1,9 @@
from typing import Iterator
from collections.abc import Iterator
from browserexport.merge import Visit, merge_visits
from my.core import Stats
from my.core.source import import_source
from browserexport.merge import merge_visits, Visit
src_export = import_source(module_name="my.browser.export")
src_active = import_source(module_name="my.browser.active_browser")

View file

@ -1,11 +1,8 @@
import os
from my.core.util import __NOT_HPI_MODULE__
def _patch_browserexport_logs():
# patch browserexport logs if HPI_LOGS is present
if "HPI_LOGS" in os.environ:
def _patch_browserexport_logs(level: int):
# grab the computed level (respects LOGGING_LEVEL_ prefixes) and set it on the browserexport logger
from browserexport.log import setup as setup_browserexport_logger
from my.core.logging import mklevel
setup_browserexport_logger(mklevel(os.environ["HPI_LOGS"]))
setup_browserexport_logger(level)

View file

@ -1,33 +1,37 @@
"""
Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]]
Parses browser history using [[http://github.com/purarue/browserexport][browserexport]]
"""
REQUIRES = ["browserexport"]
from my.config import browser as user_config
from my.core import Paths, dataclass
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from pathlib import Path
from browserexport.merge import Visit, read_and_merge
from my.core import (
Paths,
Stats,
get_files,
make_logger,
stat,
)
from my.core.cachew import mcachew
from .common import _patch_browserexport_logs
import my.config # isort: skip
@dataclass
class config(user_config.export):
class config(my.config.browser.export):
# path[s]/glob to your backed up browser history sqlite files
export_path: Paths
from pathlib import Path
from typing import Iterator, Sequence, List
from my.core import Stats, get_files, LazyLogger
from my.core.common import mcachew
from browserexport.merge import read_and_merge, Visit
from .common import _patch_browserexport_logs
logger = LazyLogger(__name__, level="warning")
_patch_browserexport_logs()
logger = make_logger(__name__)
_patch_browserexport_logs(logger.level)
# all of my backed up databases
@ -35,16 +39,10 @@ def inputs() -> Sequence[Path]:
return get_files(config.export_path)
def _cachew_depends_on() -> List[str]:
return [str(f) for f in inputs()]
@mcachew(depends_on=_cachew_depends_on, logger=logger)
@mcachew(depends_on=inputs, logger=logger)
def history() -> Iterator[Visit]:
yield from read_and_merge(inputs())
def stats() -> Stats:
from my.core import stat
return {**stat(history)}

View file

@ -3,24 +3,24 @@ Bumble data from Android app database (in =/data/data/com.bumble.app/databases/C
"""
from __future__ import annotations
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from datetime import datetime
from typing import Iterator, Sequence, Optional, Dict
from pathlib import Path
from more_itertools import unique_everseen
from my.config import bumble as user_config
from my.core import Paths, get_files
from my.config import bumble as user_config # isort: skip
from ..core import Paths
@dataclass
class config(user_config.android):
# paths[s]/glob to the exported sqlite databases
export_path: Paths
from ..core import get_files
from pathlib import Path
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
@ -43,20 +43,23 @@ class _BaseMessage:
@dataclass(unsafe_hash=True)
class _Message(_BaseMessage):
conversation_id: str
reply_to_id: Optional[str]
reply_to_id: str | None
@dataclass(unsafe_hash=True)
class Message(_BaseMessage):
person: Person
reply_to: Optional[Message]
reply_to: Message | None
import json
from typing import Union
from ..core import Res, assert_never
import sqlite3
from ..core.sqlite import sqlite_connect_immutable, select
from typing import Union
from my.core.compat import assert_never
from ..core import Res
from ..core.sqlite import select, sqlite_connect_immutable
EntitiesRes = Res[Union[Person, _Message]]
@ -89,7 +92,7 @@ def _handle_db(db: sqlite3.Connection) -> Iterator[EntitiesRes]:
db=db
):
try:
key = {'TEXT': 'text', 'QUESTION_GAME': 'text', 'IMAGE': 'url', 'GIF': 'url'}[payload_type]
key = {'TEXT': 'text', 'QUESTION_GAME': 'text', 'IMAGE': 'url', 'GIF': 'url', 'AUDIO': 'url', 'VIDEO': 'url'}[payload_type]
text = json.loads(payload)[key]
yield _Message(
id=id,
@ -106,17 +109,21 @@ def _handle_db(db: sqlite3.Connection) -> Iterator[EntitiesRes]:
def _key(r: EntitiesRes):
if isinstance(r, _Message):
if '&srv_width=' in r.text:
if '/hidden?' in r.text:
# ugh. seems that image URLs change all the time in the db?
# can't access them without login anyway
# so use a different key for such messages
# todo maybe normalize text instead? since it's gonna always trigger diffs down the line
return (r.id, r.created)
return r
_UNKNOWN_PERSON = "UNKNOWN_PERSON"
def messages() -> Iterator[Res[Message]]:
id2person: Dict[str, Person] = {}
id2msg: Dict[str, Message] = {}
id2person: dict[str, Person] = {}
id2msg: dict[str, Message] = {}
for x in unique_everseen(_entities(), key=_key):
if isinstance(x, Exception):
yield x
@ -126,8 +133,12 @@ def messages() -> Iterator[Res[Message]]:
continue
if isinstance(x, _Message):
reply_to_id = x.reply_to_id
# hmm seems that sometimes there are messages with no corresponding conversation_info?
# possibly if user never clicked on conversation before..
person = id2person.get(x.conversation_id)
if person is None:
person = Person(user_id=x.conversation_id, user_name=_UNKNOWN_PERSON)
try:
person = id2person[x.conversation_id]
reply_to = None if reply_to_id is None else id2msg[reply_to_id]
except Exception as e:
yield e

View file

@ -9,16 +9,18 @@ from datetime import date, datetime, timedelta
from functools import lru_cache
from typing import Union
from ..core.time import zone_to_countrycode
from my.core import Stats
from my.core.time import zone_to_countrycode
@lru_cache(1)
def _calendar():
from workalendar.registry import registry # type: ignore
# todo switch to using time.tz.main once _get_tz stabilizes?
from ..time.tz import via_location as LTZ
# TODO would be nice to do it dynamically depending on the past timezones...
tz = LTZ._get_tz(datetime.now())
tz = LTZ.get_tz(datetime.now())
assert tz is not None
zone = tz.zone; assert zone is not None
code = zone_to_countrycode(zone)
@ -46,7 +48,6 @@ def is_workday(d: DateIsh) -> bool:
return not is_holiday(d)
from ..core.common import Stats
def stats() -> Stats:
# meh, but not sure what would be a better test?
res = {}

View file

@ -1,7 +1,6 @@
import my.config as config
from .core import __NOT_HPI_MODULE__
from .core import warnings as W
# still used in Promnesia, maybe in dashboard?

78
my/codeforces.py Normal file
View file

@ -0,0 +1,78 @@
import json
from collections.abc import Iterator, Sequence
from dataclasses import dataclass
from datetime import datetime, timezone
from functools import cached_property
from pathlib import Path
from my.config import codeforces as config # type: ignore[attr-defined]
from my.core import Res, datetime_aware, get_files
def inputs() -> Sequence[Path]:
return get_files(config.export_path)
ContestId = int
@dataclass
class Contest:
contest_id: ContestId
when: datetime_aware
name: str
@dataclass
class Competition:
contest: Contest
old_rating: int
new_rating: int
@cached_property
def when(self) -> datetime_aware:
return self.contest.when
# todo not sure if parser is the best name? hmm
class Parser:
def __init__(self, *, inputs: Sequence[Path]) -> None:
self.inputs = inputs
self.contests: dict[ContestId, Contest] = {}
def _parse_allcontests(self, p: Path) -> Iterator[Contest]:
j = json.loads(p.read_text())
for c in j['result']:
yield Contest(
contest_id=c['id'],
when=datetime.fromtimestamp(c['startTimeSeconds'], tz=timezone.utc),
name=c['name'],
)
def _parse_competitions(self, p: Path) -> Iterator[Competition]:
j = json.loads(p.read_text())
for c in j['result']:
contest_id = c['contestId']
contest = self.contests[contest_id]
yield Competition(
contest=contest,
old_rating=c['oldRating'],
new_rating=c['newRating'],
)
def parse(self) -> Iterator[Res[Competition]]:
for path in inputs():
if 'allcontests' in path.name:
# these contain information about all CF contests along with useful metadata
for contest in self._parse_allcontests(path):
# TODO some method to assert on mismatch if it exists? not sure
self.contests[contest.contest_id] = contest
elif 'codeforces' in path.name:
# these contain only contests the user participated in
yield from self._parse_competitions(path)
else:
raise RuntimeError(f"shouldn't happen: {path.name}")
def data() -> Iterator[Res[Competition]]:
return Parser(inputs=inputs()).parse()

View file

@ -1,92 +0,0 @@
#!/usr/bin/env python3
from my.config import codeforces as config # type: ignore[attr-defined]
from datetime import datetime, timezone
from typing import NamedTuple
import json
from typing import Dict, Iterator
from ..core import get_files, Res, unwrap
from ..core.compat import cached_property
from ..core.konsume import ignore, wrap
Cid = int
class Contest(NamedTuple):
cid: Cid
when: datetime
@classmethod
def make(cls, j) -> 'Contest':
return cls(
cid=j['id'],
when=datetime.fromtimestamp(j['startTimeSeconds'], tz=timezone.utc),
)
Cmap = Dict[Cid, Contest]
def get_contests() -> Cmap:
last = max(get_files(config.export_path, 'allcontests*.json'))
j = json.loads(last.read_text())
d = {}
for c in j['result']:
cc = Contest.make(c)
d[cc.cid] = cc
return d
class Competition(NamedTuple):
contest_id: Cid
contest: str
cmap: Cmap
@cached_property
def uid(self) -> Cid:
return self.contest_id
def __hash__(self):
return hash(self.contest_id)
@cached_property
def when(self) -> datetime:
return self.cmap[self.uid].when
@cached_property
def summary(self) -> str:
return f'participated in {self.contest}' # TODO
@classmethod
def make(cls, cmap, json) -> Iterator[Res['Competition']]:
# TODO try here??
contest_id = json['contestId'].zoom().value
contest = json['contestName'].zoom().value
yield cls(
contest_id=contest_id,
contest=contest,
cmap=cmap,
)
# TODO ytry???
ignore(json, 'rank', 'oldRating', 'newRating')
def iter_data() -> Iterator[Res[Competition]]:
cmap = get_contests()
last = max(get_files(config.export_path, 'codeforces*.json'))
with wrap(json.loads(last.read_text())) as j:
j['status'].ignore()
res = j['result'].zoom()
for c in list(res): # TODO maybe we want 'iter' method??
ignore(c, 'handle', 'ratingUpdateTimeSeconds')
yield from Competition.make(cmap=cmap, json=c)
c.consume()
# TODO maybe if they are all empty, no need to consume??
def get_data():
return list(sorted(iter_data(), key=Competition.when.fget))

View file

@ -1,30 +1,32 @@
"""
Git commits data for repositories on your filesystem
"""
from __future__ import annotations
REQUIRES = [
'gitpython',
]
import shutil
from pathlib import Path
from datetime import datetime, timezone
from collections.abc import Iterator, Sequence
from dataclasses import dataclass, field
from typing import List, Optional, Iterator, Set, Sequence, cast
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional, cast
from my.core import PathIsh, LazyLogger, make_config
from my.core.cachew import cache_dir
from my.core.common import mcachew
from my.core import LazyLogger, PathIsh, make_config
from my.core.cachew import cache_dir, mcachew
from my.core.warnings import high
from my.config import commits as user_config # isort: skip
from my.config import commits as user_config
@dataclass
class commits_cfg(user_config):
roots: Sequence[PathIsh] = field(default_factory=list)
emails: Optional[Sequence[str]] = None
names: Optional[Sequence[str]] = None
emails: Sequence[str] | None = None
names: Sequence[str] | None = None
# experiment to make it lazy?
@ -38,9 +40,8 @@ def config() -> commits_cfg:
##########################
import git # type: ignore
from git.repo.fun import is_git_dir # type: ignore
import git
from git.repo.fun import is_git_dir
log = LazyLogger(__name__, level='info')
@ -94,7 +95,7 @@ def _git_root(git_dir: PathIsh) -> Path:
return gd # must be bare
def _repo_commits_aux(gr: git.Repo, rev: str, emitted: Set[str]) -> Iterator[Commit]:
def _repo_commits_aux(gr: git.Repo, rev: str, emitted: set[str]) -> Iterator[Commit]:
# without path might not handle pull heads properly
for c in gr.iter_commits(rev=rev):
if not by_me(c):
@ -121,7 +122,7 @@ def _repo_commits_aux(gr: git.Repo, rev: str, emitted: Set[str]) -> Iterator[Com
def repo_commits(repo: PathIsh):
gr = git.Repo(str(repo))
emitted: Set[str] = set()
emitted: set[str] = set()
for r in gr.references:
yield from _repo_commits_aux(gr=gr, rev=r.path, emitted=emitted)
@ -142,56 +143,56 @@ def canonical_name(repo: Path) -> str:
def _fd_path() -> str:
# todo move it to core
fd_path: Optional[str] = shutil.which("fdfind") or shutil.which("fd-find") or shutil.which("fd")
fd_path: str | None = shutil.which("fdfind") or shutil.which("fd-find") or shutil.which("fd")
if fd_path is None:
high("my.coding.commits requires 'fd' to be installed, See https://github.com/sharkdp/fd#installation")
assert fd_path is not None
return fd_path
def git_repos_in(roots: List[Path]) -> List[Path]:
def git_repos_in(roots: list[Path]) -> list[Path]:
from subprocess import check_output
outputs = check_output([
_fd_path(),
# '--follow', # right, not so sure about follow... make configurable?
'--hidden',
'--no-ignore', # otherwise doesn't go inside .git directory (from fd v9)
'--full-path',
'--type', 'f',
'/HEAD', # judging by is_git_dir, it should always be here..
*roots,
]).decode('utf8').splitlines()
candidates = set(Path(o).resolve().absolute().parent for o in outputs)
candidates = {Path(o).resolve().absolute().parent for o in outputs}
# exclude stuff within .git dirs (can happen for submodules?)
candidates = {c for c in candidates if '.git' not in c.parts[:-1]}
candidates = {c for c in candidates if is_git_dir(c)}
repos = list(sorted(map(_git_root, candidates)))
repos = sorted(map(_git_root, candidates))
return repos
def repos() -> List[Path]:
def repos() -> list[Path]:
return git_repos_in(list(map(Path, config().roots)))
# returns modification time for an index to use as hash function
def _repo_depends_on(_repo: Path) -> int:
for pp in {
for pp in [
".git/FETCH_HEAD",
".git/HEAD",
"FETCH_HEAD", # bare
"HEAD", # bare
}:
]:
ff = _repo / pp
if ff.exists():
return int(ff.stat().st_mtime)
else:
raise RuntimeError(f"Could not find a FETCH_HEAD/HEAD file in {_repo}")
def _commits(_repos: List[Path]) -> Iterator[Commit]:
def _commits(_repos: list[Path]) -> Iterator[Commit]:
for r in _repos:
yield from _cached_commits(r)

View file

@ -1,9 +1,12 @@
import warnings
from typing import TYPE_CHECKING
warnings.warn('my.coding.github is deprecated! Please use my.github.all instead!')
from my.core import warnings
warnings.high('my.coding.github is deprecated! Please use my.github.all instead!')
# todo why aren't DeprecationWarning shown by default??
from ..github.all import events, get_events
if not TYPE_CHECKING:
from ..github.all import events, get_events # noqa: F401
# todo deprecate properly
iter_events = events

View file

@ -1,84 +0,0 @@
#!/usr/bin/env python3
from my.config import topcoder as config # type: ignore[attr-defined]
from datetime import datetime
from typing import NamedTuple
import json
from typing import Dict, Iterator
from ..core import get_files, Res, unwrap, Json
from ..core.compat import cached_property
from ..core.error import Res, unwrap
from ..core.konsume import zoom, wrap, ignore
def _get_latest() -> Json:
pp = max(get_files(config.export_path))
return json.loads(pp.read_text())
class Competition(NamedTuple):
contest_id: str
contest: str
percentile: float
dates: str
@cached_property
def uid(self) -> str:
return self.contest_id
def __hash__(self):
return hash(self.contest_id)
@cached_property
def when(self) -> datetime:
return datetime.strptime(self.dates, '%Y-%m-%dT%H:%M:%S.%fZ')
@cached_property
def summary(self) -> str:
return f'participated in {self.contest}: {self.percentile:.0f}'
@classmethod
def make(cls, json) -> Iterator[Res['Competition']]:
ignore(json, 'rating', 'placement')
cid = json['challengeId'].zoom().value
cname = json['challengeName'].zoom().value
percentile = json['percentile'].zoom().value
dates = json['date'].zoom().value
yield cls(
contest_id=cid,
contest=cname,
percentile=percentile,
dates=dates,
)
def iter_data() -> Iterator[Res[Competition]]:
with wrap(_get_latest()) as j:
ignore(j, 'id', 'version')
res = j['result'].zoom()
ignore(res, 'success', 'status', 'metadata')
cont = res['content'].zoom()
ignore(cont, 'handle', 'handleLower', 'userId', 'createdAt', 'updatedAt', 'createdBy', 'updatedBy')
cont['DEVELOP'].ignore() # TODO handle it??
ds = cont['DATA_SCIENCE'].zoom()
mar, srm = zoom(ds, 'MARATHON_MATCH', 'SRM')
mar = mar['history'].zoom()
srm = srm['history'].zoom()
# TODO right, I guess I could rely on pylint for unused variables??
for c in mar + srm:
yield from Competition.make(json=c)
c.consume()
def get_data():
return list(sorted(iter_data(), key=Competition.when.fget))

View file

@ -1,6 +1,6 @@
from .core.warnings import high
high("DEPRECATED! Please use my.core.common instead.")
from .core import __NOT_HPI_MODULE__
from .core.common import *

View file

@ -9,17 +9,18 @@ This file is used for:
- mypy: this file provides some type annotations
- for loading the actual user config
'''
from __future__ import annotations
#### NOTE: you won't need this line VVVV in your personal config
from my.core import init
from my.core import init # noqa: F401 # isort: skip
###
from datetime import tzinfo
from pathlib import Path
from typing import List
from my.core import Paths, PathIsh
from my.core import PathIsh, Paths
class hypothesis:
@ -68,27 +69,45 @@ class pinboard:
export_dir: Paths = ''
class google:
class maps:
class android:
export_path: Paths = ''
takeout_path: Paths = ''
from typing import Sequence, Union, Tuple
from datetime import datetime, date
from collections.abc import Sequence
from datetime import date, datetime, timedelta
from typing import Union
DateIsh = Union[datetime, date, str]
LatLon = Tuple[float, float]
LatLon = tuple[float, float]
class location:
# todo ugh, need to think about it... mypy wants the type here to be general, otherwise it can't deduce
# and we can't import the types from the module itself, otherwise would be circular. common module?
home: Union[LatLon, Sequence[Tuple[DateIsh, LatLon]]] = (1.0, -1.0)
home: LatLon | Sequence[tuple[DateIsh, LatLon]] = (1.0, -1.0)
home_accuracy = 30_000.0
class via_ip:
accuracy: float
for_duration: timedelta
class gpslogger:
export_path: Paths = ''
accuracy: float
class google_takeout_semantic:
# a value between 0 and 100, 100 being the most confident
# set to 0 to include all locations
# https://locationhistoryformat.com/reference/semantic/#/$defs/placeVisit/properties/locationConfidence
require_confidence: float = 40
# default accuracy for semantic locations
accuracy: float = 100
from typing import Literal
from my.core.compat import Literal
class time:
class tz:
policy: Literal['keep', 'convert', 'throw']
@ -107,10 +126,9 @@ class arbtt:
logfiles: Paths
from typing import Optional
class commits:
emails: Optional[Sequence[str]]
names: Optional[Sequence[str]]
emails: Sequence[str] | None
names: Sequence[str] | None
roots: Sequence[PathIsh]
@ -136,6 +154,9 @@ class tinder:
class instagram:
class android:
export_path: Paths
username: str | None
full_name: str | None
class gdpr:
export_path: Paths
@ -152,7 +173,7 @@ class materialistic:
class fbmessenger:
class fbmessengerexport:
export_db: PathIsh
facebook_id: Optional[str]
facebook_id: str | None
class android:
export_path: Paths
@ -164,6 +185,8 @@ class twitter_archive:
class twitter:
class talon:
export_path: Paths
class android:
export_path: Paths
class twint:
@ -194,6 +217,7 @@ class simple:
class vk_messages_backup:
storage_path: Path
user_id: int
class kobo:
@ -227,7 +251,7 @@ class runnerup:
class emfit:
export_path: Path
timezone: tzinfo
excluded_sids: List[str]
excluded_sids: list[str]
class foursquare:
@ -247,5 +271,16 @@ class roamresearch:
username: str
class whatsapp:
class android:
export_path: Paths
my_user_id: str | None
class harmonic:
export_path: Paths
class monzo:
class monzoexport:
export_path: Paths

View file

@ -1,39 +1,61 @@
# this file only keeps the most common & critical types/utility functions
from .common import get_files, PathIsh, Paths
from .common import Json
from .common import LazyLogger
from .common import warn_if_empty
from .common import stat, Stats
from .common import datetime_naive, datetime_aware
from .common import assert_never
from typing import TYPE_CHECKING
from .cfg import make_config
from .common import PathIsh, Paths, get_files
from .compat import assert_never
from .error import Res, notnone, unwrap
from .logging import (
make_logger,
)
from .stats import Stats, stat
from .types import (
Json,
datetime_aware,
datetime_naive,
)
from .util import __NOT_HPI_MODULE__
from .utils.itertools import warn_if_empty
from .error import Res, unwrap
LazyLogger = make_logger # TODO deprecate this in favor of make_logger
# just for brevity in modules
# todo not sure about these.. maybe best to rely on regular imports.. perhaps compare?
if not TYPE_CHECKING:
# we used to keep these here for brevity, but feels like it only adds confusion,
# e.g. suggest that we perhaps somehow modify builtin behaviour or whatever
# so best to prefer explicit behaviour
from dataclasses import dataclass
from pathlib import Path
__all__ = [
'get_files', 'PathIsh', 'Paths',
'Json',
'LazyLogger',
'warn_if_empty',
'stat', 'Stats',
'datetime_aware', 'datetime_naive',
'assert_never',
'make_config',
'__NOT_HPI_MODULE__',
'Res', 'unwrap',
'dataclass', 'Path',
'Json',
'LazyLogger', # legacy import
'Path',
'PathIsh',
'Paths',
'Res',
'Stats',
'assert_never', # TODO maybe deprecate from use in my.core? will be in stdlib soon
'dataclass',
'datetime_aware',
'datetime_naive',
'get_files',
'make_config',
'make_logger',
'notnone',
'stat',
'unwrap',
'warn_if_empty',
]
## experimental for now
# you could put _init_hook.py next to your private my/config
# that way you can configure logging/warnings/env variables on every HPI import
try:
import my._init_hook # type: ignore[import-not-found] # noqa: F401
except:
pass
##

View file

@ -1,23 +1,26 @@
from contextlib import ExitStack
from __future__ import annotations
import functools
import importlib
import inspect
from itertools import chain
import os
import shlex
import shutil
import sys
import tempfile
import traceback
from typing import Optional, Sequence, Iterable, List, Type, Any, Callable
from collections.abc import Iterable, Sequence
from contextlib import ExitStack
from itertools import chain
from pathlib import Path
from subprocess import check_call, run, PIPE, CompletedProcess, Popen
from subprocess import PIPE, CompletedProcess, Popen, check_call, run
from typing import Any, Callable
import click
@functools.lru_cache()
def mypy_cmd() -> Optional[Sequence[str]]:
@functools.lru_cache
def mypy_cmd() -> Sequence[str] | None:
try:
# preferably, use mypy from current python env
import mypy # noqa: F401 fine not to use it
@ -32,7 +35,7 @@ def mypy_cmd() -> Optional[Sequence[str]]:
return None
def run_mypy(cfg_path: Path) -> Optional[CompletedProcess]:
def run_mypy(cfg_path: Path) -> CompletedProcess | None:
# todo dunno maybe use the same mypy config in repository?
# I'd need to install mypy.ini then??
env = {**os.environ}
@ -43,7 +46,7 @@ def run_mypy(cfg_path: Path) -> Optional[CompletedProcess]:
cmd = mypy_cmd()
if cmd is None:
return None
mres = run([
mres = run([ # noqa: UP022,PLW1510
*cmd,
'--namespace-packages',
'--color-output', # not sure if works??
@ -63,22 +66,28 @@ def eprint(x: str) -> None:
# err=True prints to stderr
click.echo(x, err=True)
def indent(x: str) -> str:
# todo use textwrap.indent?
return ''.join(' ' + l for l in x.splitlines(keepends=True))
OK = ''
OFF = '🔲'
def info(x: str) -> None:
eprint(OK + ' ' + x)
def error(x: str) -> None:
eprint('' + x)
def warning(x: str) -> None:
eprint('' + x) # todo yellow?
def tb(e: Exception) -> None:
tb = ''.join(traceback.format_exception(Exception, e, e.__traceback__))
sys.stderr.write(indent(tb))
@ -86,6 +95,7 @@ def tb(e: Exception) -> None:
def config_create() -> None:
from .preinit import get_mycfg_dir
mycfg_dir = get_mycfg_dir()
created = False
@ -94,7 +104,8 @@ def config_create() -> None:
my_config = mycfg_dir / 'my' / 'config' / '__init__.py'
my_config.parent.mkdir(parents=True)
my_config.write_text('''
my_config.write_text(
'''
### HPI personal config
## see
# https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-modules
@ -117,7 +128,8 @@ class example:
### you can insert your own configuration below
### but feel free to delete the stuff above if you don't need ti
'''.lstrip())
'''.lstrip()
)
info(f'created empty config: {my_config}')
created = True
else:
@ -130,12 +142,13 @@ class example:
# todo return the config as a result?
def config_ok() -> bool:
errors: List[Exception] = []
errors: list[Exception] = []
# at this point 'my' should already be imported, so doesn't hurt to extract paths from it
import my
try:
paths: List[str] = list(my.__path__) # type: ignore[attr-defined]
paths: list[str] = list(my.__path__)
except Exception as e:
errors.append(e)
error('failed to determine module import path')
@ -143,21 +156,25 @@ def config_ok() -> bool:
else:
info(f'import order: {paths}')
# first try doing as much as possible without actually imporing my.config
# first try doing as much as possible without actually importing my.config
from .preinit import get_mycfg_dir
cfg_path = get_mycfg_dir()
# alternative is importing my.config and then getting cfg_path from its __file__/__path__
# not sure which is better tbh
## check we're not using stub config
import my.core
try:
core_pkg_path = str(Path(my.core.__path__[0]).parent) # type: ignore[attr-defined]
core_pkg_path = str(Path(my.core.__path__[0]).parent)
if str(cfg_path).startswith(core_pkg_path):
error(f'''
error(
f'''
Seems that the stub config is used ({cfg_path}). This is likely not going to work.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-modules for more information
'''.strip())
'''.strip()
)
errors.append(RuntimeError('bad config path'))
except Exception as e:
errors.append(e)
@ -171,16 +188,15 @@ See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-module
# use a temporary directory, useful because
# - compileall ignores -B, so always craps with .pyc files (annoyng on RO filesystems)
# - compileall isn't following symlinks, just silently ignores them
# note: ugh, annoying that copytree requires a non-existing dir before 3.8.
# once we have min version 3.8, can use dirs_exist_ok=True param
tdir = Path(td) / 'cfg'
# this will resolve symlinks when copying
shutil.copytree(cfg_path, tdir)
# NOTE: compileall still returns code 0 if the path doesn't exist..
# but in our case hopefully it's not an issue
cmd = [sys.executable, '-m', 'compileall', '-q', str(tdir)]
try:
# this will resolve symlinks when copying
# should be under try/catch since might fail if some symlinks are missing
shutil.copytree(cfg_path, tdir, dirs_exist_ok=True)
check_call(cmd)
info('syntax check: ' + ' '.join(cmd))
except Exception as e:
@ -213,13 +229,15 @@ See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-module
if len(errors) > 0:
error(f'config check: {len(errors)} errors')
return False
else:
# note: shouldn't exit here, might run something else
info('config check: success!')
return True
from .util import HPIModule, modules
def _modules(*, all: bool = False) -> Iterable[HPIModule]:
skipped = []
for m in modules():
@ -231,7 +249,7 @@ def _modules(*, all: bool=False) -> Iterable[HPIModule]:
warning(f'Skipped {len(skipped)} modules: {skipped}. Pass --all if you want to see them.')
def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: List[str]) -> None:
def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: list[str]) -> None:
if len(for_modules) > 0:
# if you're checking specific modules, show errors
# hopefully makes sense?
@ -242,10 +260,9 @@ def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: Li
import contextlib
from .common import quick_stats
from .util import get_stats, HPIModule
from .stats import guess_stats
from .error import warn_my_config_import_error
from .stats import get_stats, quick_stats
from .util import HPIModule
mods: Iterable[HPIModule]
if len(for_modules) == 0:
@ -267,7 +284,7 @@ def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: Li
# todo more specific command?
error(f'{click.style("FAIL", fg="red")}: {m:<50} loading failed{vw}')
# check that this is an import error in particular, not because
# of a ModuleNotFoundError because some dependency wasnt installed
# of a ModuleNotFoundError because some dependency wasn't installed
if isinstance(e, (ImportError, AttributeError)):
warn_my_config_import_error(e)
if verbose:
@ -275,11 +292,8 @@ def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: Li
continue
info(f'{click.style("OK", fg="green")} : {m:<50}')
# first try explicitly defined stats function:
stats = get_stats(m)
if stats is None:
# then try guessing.. not sure if should log somehow?
stats = guess_stats(m, quick=quick)
# TODO add hpi 'stats'? instead of doctor? not sure
stats = get_stats(m, guess=True)
if stats is None:
eprint(" - no 'stats' function, can't check the data")
@ -290,6 +304,7 @@ def modules_check(*, verbose: bool, list_all: bool, quick: bool, for_modules: Li
try:
kwargs = {}
# todo hmm why wouldn't they be callable??
if callable(stats) and 'quick' in inspect.signature(stats).parameters:
kwargs['quick'] = quick
with quick_context:
@ -325,17 +340,20 @@ def tabulate_warnings() -> None:
Helper to avoid visual noise in hpi modules/doctor
'''
import warnings
orig = warnings.formatwarning
def override(*args, **kwargs) -> str:
res = orig(*args, **kwargs)
return ''.join(' ' + x for x in res.splitlines(keepends=True))
warnings.formatwarning = override
# TODO loggers as well?
def _requires(modules: Sequence[str]) -> Sequence[str]:
from .discovery_pure import module_by_name
mods = [module_by_name(module) for module in modules]
res = []
for mod in mods:
@ -362,7 +380,7 @@ def module_requires(*, module: Sequence[str]) -> None:
click.echo(x)
def module_install(*, user: bool, module: Sequence[str], parallel: bool=False) -> None:
def module_install(*, user: bool, module: Sequence[str], parallel: bool = False, break_system_packages: bool = False) -> None:
if isinstance(module, str):
# legacy behavior, used to take a since argument
module = [module]
@ -373,24 +391,30 @@ def module_install(*, user: bool, module: Sequence[str], parallel: bool=False) -
warning('requirements list is empty, no need to install anything')
return
use_uv = 'HPI_MODULE_INSTALL_USE_UV' in os.environ
pre_cmd = [
sys.executable, '-m', 'pip',
sys.executable, '-m', *(['uv'] if use_uv else []), 'pip',
'install',
*(['--user'] if user else []), # todo maybe instead, forward all the remaining args to pip?
*(['--break-system-packages'] if break_system_packages else []), # https://peps.python.org/pep-0668/
]
cmds = []
# disable parallel on windows, sometimes throws a
# '[WinError 32] The process cannot access the file because it is being used by another process'
if parallel and sys.platform not in ['win32', 'cygwin']:
# same on mac it seems? possible race conditions which are hard to debug?
# WARNING: Error parsing requirements for sqlalchemy: [Errno 2] No such file or directory: '/Users/runner/work/HPI/HPI/.tox/mypy-misc/lib/python3.7/site-packages/SQLAlchemy-2.0.4.dist-info/METADATA'
if parallel and sys.platform not in ['win32', 'cygwin', 'darwin']:
# todo not really sure if it's safe to install in parallel like this
# but definitely doesn't hurt to experiment for e.g. mypy pipelines
# pip has '--use-feature=fast-deps', but it doesn't really work
# I think it only helps for pypi artifacts (not git!),
# and only if they weren't cached
for r in requirements:
cmds.append(pre_cmd + [r])
cmds.append([*pre_cmd, r])
else:
if parallel:
warning('parallel install is not supported on this platform, installing sequentially...')
# install everything in one cmd
cmds.append(pre_cmd + list(requirements))
@ -433,11 +457,11 @@ def _ui_getchar_pick(choices: Sequence[str], prompt: str = 'Select from: ') -> i
return result_map[ch]
def _locate_functions_or_prompt(qualified_names: List[str], prompt: bool = True) -> Iterable[Callable[..., Any]]:
from .query import locate_qualified_function, QueryException
def _locate_functions_or_prompt(qualified_names: list[str], *, prompt: bool = True) -> Iterable[Callable[..., Any]]:
from .query import QueryException, locate_qualified_function
from .stats import is_data_provider
# if not connected to a terminal, cant prompt
# if not connected to a terminal, can't prompt
if not sys.stdout.isatty():
prompt = False
@ -451,9 +475,9 @@ def _locate_functions_or_prompt(qualified_names: List[str], prompt: bool = True)
# user to select a 'data provider' like function
try:
mod = importlib.import_module(qualname)
except Exception:
except Exception as ie:
eprint(f"During fallback, importing '{qualname}' as module failed")
raise qr_err
raise qr_err from ie
# find data providers in this module
data_providers = [f for _, f in inspect.getmembers(mod, inspect.isfunction) if is_data_provider(f)]
@ -467,7 +491,7 @@ def _locate_functions_or_prompt(qualified_names: List[str], prompt: bool = True)
else:
choices = [f.__name__ for f in data_providers]
if prompt is False:
# theres more than one possible data provider in this module,
# there's more than one possible data provider in this module,
# STDOUT is not a TTY, can't prompt
eprint("During fallback, more than one possible data provider, can't prompt since STDOUT is not a TTY")
eprint("Specify one of:")
@ -481,30 +505,42 @@ def _locate_functions_or_prompt(qualified_names: List[str], prompt: bool = True)
yield data_providers[chosen_index]
def _warn_exceptions(exc: Exception) -> None:
from my.core import make_logger
logger = make_logger('CLI', level='warning')
logger.exception(f'hpi query: {exc}')
# handle the 'hpi query' call
# can raise a QueryException, caught in the click command
def query_hpi_functions(
*,
output: str = 'json',
stream: bool = False,
qualified_names: List[str],
order_key: Optional[str],
order_by_value_type: Optional[Type],
qualified_names: list[str],
order_key: str | None,
order_by_value_type: type | None,
after: Any,
before: Any,
within: Any,
reverse: bool = False,
limit: Optional[int],
limit: int | None,
drop_unsorted: bool,
wrap_unsorted: bool,
warn_exceptions: bool,
raise_exceptions: bool,
drop_exceptions: bool,
) -> None:
from .query_range import select_range, RangeTuple
from .query_range import RangeTuple, select_range
# chain list of functions from user, in the order they wrote them on the CLI
input_src = chain(*(f() for f in _locate_functions_or_prompt(qualified_names)))
# NOTE: if passing just one function to this which returns a single namedtuple/dataclass,
# using both --order-key and --order-type will often be faster as it does not need to
# duplicate the iterator in memory, or try to find the --order-type type on each object before sorting
res = select_range(
input_src,
order_key=order_key,
@ -514,8 +550,11 @@ def query_hpi_functions(
limit=limit,
drop_unsorted=drop_unsorted,
wrap_unsorted=wrap_unsorted,
warn_exceptions=warn_exceptions,
warn_func=_warn_exceptions,
raise_exceptions=raise_exceptions,
drop_exceptions=drop_exceptions)
drop_exceptions=drop_exceptions,
)
if output == 'json':
from .serialize import dumps
@ -538,15 +577,35 @@ def query_hpi_functions(
pprint(item)
else:
pprint(list(res))
elif output == 'gpx':
from my.location.common import locations_to_gpx
# if user didn't specify to ignore exceptions, warn if locations_to_gpx
# cannot process the output of the command. This can be silenced by
# passing --drop-exceptions
if not raise_exceptions and not drop_exceptions:
warn_exceptions = True
# can ignore the mypy warning here, locations_to_gpx yields any errors
# if you didnt pass it something that matches the LocationProtocol
for exc in locations_to_gpx(res, sys.stdout): # type: ignore[arg-type]
if warn_exceptions:
_warn_exceptions(exc)
elif raise_exceptions:
raise exc
elif drop_exceptions:
pass
sys.stdout.flush()
else:
res = list(res) # type: ignore[assignment]
# output == 'repl'
eprint(f"\nInteract with the results by using the {click.style('res', fg='green')} variable\n")
try:
import IPython # type: ignore[import]
import IPython # type: ignore[import,unused-ignore]
except ModuleNotFoundError:
eprint("'repl' typically uses ipython, install it with 'python3 -m pip install ipython'. falling back to stdlib...")
import code
code.interact(local=locals())
else:
IPython.embed()
@ -554,16 +613,16 @@ def query_hpi_functions(
@click.group()
@click.option("--debug", is_flag=True, default=False, help="Show debug logs")
def main(debug: bool) -> None:
def main(*, debug: bool) -> None:
'''
Human Programming Interface
Tool for HPI
Work in progress, will be used for config management, troubleshooting & introspection
'''
# should overwrite anything else in HPI_LOGS
# should overwrite anything else in LOGGING_LEVEL_HPI
if debug:
os.environ["HPI_LOGS"] = "debug"
os.environ['LOGGING_LEVEL_HPI'] = 'debug'
# for potential future reference, if shared state needs to be added to groups
# https://click.palletsprojects.com/en/7.x/commands/#group-invocation-without-command
@ -572,7 +631,7 @@ def main(debug: bool) -> None:
# acts as a contextmanager of sorts - any subcommand will then run
# in something like /tmp/hpi_temp_dir
# to avoid importing relative modules by accident during development
# maybe can be removed later if theres more test coverage/confidence that nothing
# maybe can be removed later if there's more test coverage/confidence that nothing
# would happen?
# use a particular directory instead of a random one, since
@ -580,20 +639,19 @@ def main(debug: bool) -> None:
# to run things at the end (would need to use a callback or pass context)
# https://click.palletsprojects.com/en/7.x/commands/#nested-handling-and-contexts
tdir: str = os.path.join(tempfile.gettempdir(), 'hpi_temp_dir')
if not os.path.exists(tdir):
os.makedirs(tdir)
tdir = Path(tempfile.gettempdir()) / 'hpi_temp_dir'
tdir.mkdir(exist_ok=True)
os.chdir(tdir)
@functools.lru_cache(maxsize=1)
def _all_mod_names() -> List[str]:
def _all_mod_names() -> list[str]:
"""Should include all modules, in case user is trying to diagnose issues"""
# sort this, so that the order doesn't change while tabbing through
return sorted([m.name for m in modules()])
def _module_autocomplete(ctx: click.Context, args: Sequence[str], incomplete: str) -> List[str]:
def _module_autocomplete(ctx: click.Context, args: Sequence[str], incomplete: str) -> list[str]:
return [m for m in _all_mod_names() if m.startswith(incomplete)]
@ -603,7 +661,7 @@ def _module_autocomplete(ctx: click.Context, args: Sequence[str], incomplete: st
@click.option('-q', '--quick', is_flag=True, help='Only run partial checks (first 100 items)')
@click.option('-S', '--skip-config-check', 'skip_conf', is_flag=True, help='Skip configuration check')
@click.argument('MODULE', nargs=-1, required=False, shell_complete=_module_autocomplete)
def doctor_cmd(verbose: bool, list_all: bool, quick: bool, skip_conf: bool, module: Sequence[str]) -> None:
def doctor_cmd(*, verbose: bool, list_all: bool, quick: bool, skip_conf: bool, module: Sequence[str]) -> None:
'''
Run various checks
@ -637,7 +695,7 @@ def config_create_cmd() -> None:
@main.command(name='modules', short_help='list available modules')
@click.option('--all', 'list_all', is_flag=True, help='List all modules, including disabled')
def module_cmd(list_all: bool) -> None:
def module_cmd(*, list_all: bool) -> None:
'''List available modules'''
list_modules(list_all=list_all)
@ -650,7 +708,7 @@ def module_grp() -> None:
@module_grp.command(name='requires', short_help='print module reqs')
@click.argument('MODULES', shell_complete=_module_autocomplete, nargs=-1, required=True)
def module_requires_cmd(modules: Sequence[str]) -> None:
def module_requires_cmd(*, modules: Sequence[str]) -> None:
'''
Print MODULES requirements
@ -662,22 +720,26 @@ def module_requires_cmd(modules: Sequence[str]) -> None:
@module_grp.command(name='install', short_help='install module deps')
@click.option('--user', is_flag=True, help='same as pip --user')
@click.option('--parallel', is_flag=True, help='EXPERIMENTAL. Install dependencies in parallel.')
@click.option('-B',
'--break-system-packages',
is_flag=True,
help='Bypass PEP 668 and install dependencies into the system-wide python package directory.')
@click.argument('MODULES', shell_complete=_module_autocomplete, nargs=-1, required=True)
def module_install_cmd(user: bool, parallel: bool, modules: Sequence[str]) -> None:
def module_install_cmd(*, user: bool, parallel: bool, break_system_packages: bool, modules: Sequence[str]) -> None:
'''
Install dependencies for modules using pip
MODULES is one or more specific module names (e.g. my.reddit.rexport)
'''
# todo could add functions to check specific module etc..
module_install(user=user, module=modules, parallel=parallel)
module_install(user=user, module=modules, parallel=parallel, break_system_packages=break_system_packages)
@main.command(name='query', short_help='query the results of a HPI function')
@click.option('-o',
'--output',
default='json',
type=click.Choice(['json', 'pprint', 'repl']),
type=click.Choice(['json', 'pprint', 'repl', 'gpx']),
help='what to do with the result [default: json]')
@click.option('-s',
'--stream',
@ -730,6 +792,10 @@ def module_install_cmd(user: bool, parallel: bool, modules: Sequence[str]) -> No
default=False,
is_flag=True,
help="if the order of an item can't be determined while ordering, wrap them into an 'Unsortable' object")
@click.option('--warn-exceptions',
default=False,
is_flag=True,
help="if any errors are returned, print them as errors on STDERR")
@click.option('--raise-exceptions',
default=False,
is_flag=True,
@ -740,19 +806,21 @@ def module_install_cmd(user: bool, parallel: bool, modules: Sequence[str]) -> No
help='ignore any errors returned as objects from the functions')
@click.argument('FUNCTION_NAME', nargs=-1, required=True, shell_complete=_module_autocomplete)
def query_cmd(
*,
function_name: Sequence[str],
output: str,
stream: bool,
order_key: Optional[str],
order_type: Optional[str],
after: Optional[str],
before: Optional[str],
within: Optional[str],
recent: Optional[str],
order_key: str | None,
order_type: str | None,
after: str | None,
before: str | None,
within: str | None,
recent: str | None,
reverse: bool,
limit: Optional[int],
limit: int | None,
drop_unsorted: bool,
wrap_unsorted: bool,
warn_exceptions: bool,
raise_exceptions: bool,
drop_exceptions: bool,
) -> None:
@ -780,12 +848,12 @@ def query_cmd(
\b
Can also query within a range. To filter comments between 2016 and 2018:
hpi query --order-type datetime --after '2016-01-01 00:00:00' --before '2019-01-01 00:00:00' my.reddit.all.comments
hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
'''
from datetime import datetime, date
from datetime import date, datetime
chosen_order_type: Optional[Type]
chosen_order_type: type | None
if order_type == "datetime":
chosen_order_type = datetime
elif order_type == "date":
@ -819,8 +887,10 @@ def query_cmd(
limit=limit,
drop_unsorted=drop_unsorted,
wrap_unsorted=wrap_unsorted,
warn_exceptions=warn_exceptions,
raise_exceptions=raise_exceptions,
drop_exceptions=drop_exceptions)
drop_exceptions=drop_exceptions,
)
except QueryException as qe:
eprint(str(qe))
sys.exit(1)
@ -835,6 +905,7 @@ def query_cmd(
def test_requires() -> None:
from click.testing import CliRunner
result = CliRunner().invoke(main, ['module', 'requires', 'my.github.ghexport', 'my.browser.export'])
assert result.exit_code == 0
assert "github.com/karlicoss/ghexport" in result.output

35
my/core/_cpu_pool.py Normal file
View file

@ -0,0 +1,35 @@
"""
EXPERIMENTAL! use with caution
Manages 'global' ProcessPoolExecutor which is 'managed' by HPI itself, and
can be passed down to DALs to speed up data processing.
The reason to have it managed by HPI is because we don't want DALs instantiate pools
themselves -- they can't cooperate and it would be hard/infeasible to control
how many cores we want to dedicate to the DAL.
Enabled by the env variable, specifying how many cores to dedicate
e.g. "HPI_CPU_POOL=4 hpi query ..."
"""
from __future__ import annotations
import os
from concurrent.futures import ProcessPoolExecutor
from typing import cast
_NOT_SET = cast(ProcessPoolExecutor, object())
_INSTANCE: ProcessPoolExecutor | None = _NOT_SET
def get_cpu_pool() -> ProcessPoolExecutor | None:
global _INSTANCE
if _INSTANCE is _NOT_SET:
use_cpu_pool = os.environ.get('HPI_CPU_POOL')
if use_cpu_pool is None or int(use_cpu_pool) == 0:
_INSTANCE = None
else:
# NOTE: this won't be cleaned up properly, but I guess it's fine?
# since this it's basically a singleton for the whole process
# , and will be destroyed when python exists
_INSTANCE = ProcessPoolExecutor(max_workers=int(use_cpu_pool))
return _INSTANCE

View file

@ -0,0 +1,12 @@
from ..common import PathIsh
from ..sqlite import sqlite_connect_immutable
def connect_readonly(db: PathIsh):
import dataset # type: ignore
# see https://github.com/pudo/dataset/issues/136#issuecomment-128693122
# todo not sure if mode=ro has any benefit, but it doesn't work on read-only filesystems
# maybe it should autodetect readonly filesystems and apply this? not sure
creator = lambda: sqlite_connect_immutable(db)
return dataset.connect('sqlite:///', engine_kwargs={'creator': creator})

View file

@ -0,0 +1,261 @@
"""
Various helpers for compression
"""
# fmt: off
from __future__ import annotations
import io
import pathlib
from collections.abc import Iterator, Sequence
from datetime import datetime
from functools import total_ordering
from pathlib import Path
from typing import IO, Union
PathIsh = Union[Path, str]
class Ext:
xz = '.xz'
zip = '.zip'
lz4 = '.lz4'
zstd = '.zstd'
zst = '.zst'
targz = '.tar.gz'
def is_compressed(p: Path) -> bool:
# todo kinda lame way for now.. use mime ideally?
# should cooperate with kompress.kopen?
return any(p.name.endswith(ext) for ext in [Ext.xz, Ext.zip, Ext.lz4, Ext.zstd, Ext.zst, Ext.targz])
def _zstd_open(path: Path, *args, **kwargs) -> IO:
import zstandard as zstd # type: ignore
fh = path.open('rb')
dctx = zstd.ZstdDecompressor()
reader = dctx.stream_reader(fh)
mode = kwargs.get('mode', 'rt')
if mode == 'rb':
return reader
else:
# must be text mode
kwargs.pop('mode') # TextIOWrapper doesn't like it
return io.TextIOWrapper(reader, **kwargs) # meh
# TODO use the 'dependent type' trick for return type?
def kopen(path: PathIsh, *args, mode: str='rt', **kwargs) -> IO:
# just in case, but I think this shouldn't be necessary anymore
# since when we call .read_text, encoding is passed already
if mode in {'r', 'rt'}:
encoding = kwargs.get('encoding', 'utf8')
else:
encoding = None
kwargs['encoding'] = encoding
pp = Path(path)
name = pp.name
if name.endswith(Ext.xz):
import lzma
# ugh. for lzma, 'r' means 'rb'
# https://github.com/python/cpython/blob/d01cf5072be5511595b6d0c35ace6c1b07716f8d/Lib/lzma.py#L97
# whereas for regular open, 'r' means 'rt'
# https://docs.python.org/3/library/functions.html#open
if mode == 'r':
mode = 'rt'
kwargs['mode'] = mode
return lzma.open(pp, *args, **kwargs)
elif name.endswith(Ext.zip):
# eh. this behaviour is a bit dodgy...
from zipfile import ZipFile
zfile = ZipFile(pp)
[subpath] = args # meh?
## oh god... https://stackoverflow.com/a/5639960/706389
ifile = zfile.open(subpath, mode='r')
ifile.readable = lambda: True # type: ignore
ifile.writable = lambda: False # type: ignore
ifile.seekable = lambda: False # type: ignore
ifile.read1 = ifile.read # type: ignore
# TODO pass all kwargs here??
# todo 'expected "BinaryIO"'??
return io.TextIOWrapper(ifile, encoding=encoding)
elif name.endswith(Ext.lz4):
import lz4.frame # type: ignore
return lz4.frame.open(str(pp), mode, *args, **kwargs)
elif name.endswith(Ext.zstd) or name.endswith(Ext.zst): # noqa: PIE810
kwargs['mode'] = mode
return _zstd_open(pp, *args, **kwargs)
elif name.endswith(Ext.targz):
import tarfile
# FIXME pass mode?
tf = tarfile.open(pp)
# TODO pass encoding?
x = tf.extractfile(*args); assert x is not None
return x
else:
return pp.open(mode, *args, **kwargs)
import os
import typing
if typing.TYPE_CHECKING:
# otherwise mypy can't figure out that BasePath is a type alias..
BasePath = pathlib.Path
else:
BasePath = pathlib.WindowsPath if os.name == 'nt' else pathlib.PosixPath
class CPath(BasePath):
"""
Hacky way to support compressed files.
If you can think of a better way to do this, please let me know! https://github.com/karlicoss/HPI/issues/20
Ugh. So, can't override Path because of some _flavour thing.
Path only has _accessor and _closed slots, so can't directly set .open method
_accessor.open has to return file descriptor, doesn't work for compressed stuff.
"""
def open(self, *args, **kwargs): # noqa: ARG002
kopen_kwargs = {}
mode = kwargs.get('mode')
if mode is not None:
kopen_kwargs['mode'] = mode
encoding = kwargs.get('encoding')
if encoding is not None:
kopen_kwargs['encoding'] = encoding
# TODO assert read only?
return kopen(str(self), **kopen_kwargs)
open = kopen # TODO deprecate
# meh
# TODO ideally switch to ZipPath or smth similar?
# nothing else supports subpath properly anyway
def kexists(path: PathIsh, subpath: str) -> bool:
try:
kopen(path, subpath)
except Exception:
return False
else:
return True
import zipfile
# meh... zipfile.Path is not available on 3.7
zipfile_Path = zipfile.Path
@total_ordering
class ZipPath(zipfile_Path):
# NOTE: is_dir/is_file might not behave as expected, the base class checks it only based on the slash in path
# seems that root/at are not exposed in the docs, so might be an implementation detail
root: zipfile.ZipFile # type: ignore[assignment]
at: str
@property
def filepath(self) -> Path:
res = self.root.filename
assert res is not None # make mypy happy
return Path(res)
@property
def subpath(self) -> Path:
return Path(self.at)
def absolute(self) -> ZipPath:
return ZipPath(self.filepath.absolute(), self.at)
def expanduser(self) -> ZipPath:
return ZipPath(self.filepath.expanduser(), self.at)
def exists(self) -> bool:
if self.at == '':
# special case, the base class returns False in this case for some reason
return self.filepath.exists()
return super().exists() or self._as_dir().exists()
def _as_dir(self) -> zipfile_Path:
# note: seems that zip always uses forward slash, regardless OS?
return zipfile_Path(self.root, self.at + '/')
def rglob(self, glob: str) -> Iterator[ZipPath]:
# note: not 100% sure about the correctness, but seem fine?
# Path.match() matches from the right, so need to
rpaths = [p for p in self.root.namelist() if p.startswith(self.at)]
rpaths = [p for p in rpaths if Path(p).match(glob)]
return (ZipPath(self.root, p) for p in rpaths)
def relative_to(self, other: ZipPath) -> Path: # type: ignore[override, unused-ignore]
assert self.filepath == other.filepath, (self.filepath, other.filepath)
return self.subpath.relative_to(other.subpath)
@property
def parts(self) -> Sequence[str]:
# messy, but might be ok..
return self.filepath.parts + self.subpath.parts
def __truediv__(self, key) -> ZipPath:
# need to implement it so the return type is not zipfile.Path
tmp = zipfile_Path(self.root) / self.at / key
return ZipPath(self.root, tmp.at)
def iterdir(self) -> Iterator[ZipPath]:
for s in self._as_dir().iterdir():
yield ZipPath(s.root, s.at)
@property
def stem(self) -> str:
return self.subpath.stem
@property # type: ignore[misc]
def __class__(self):
return Path
def __eq__(self, other) -> bool:
# hmm, super class doesn't seem to treat as equals unless they are the same object
if not isinstance(other, ZipPath):
return False
return (self.filepath, self.subpath) == (other.filepath, other.subpath)
def __lt__(self, other) -> bool:
if not isinstance(other, ZipPath):
return False
return (self.filepath, self.subpath) < (other.filepath, other.subpath)
def __hash__(self) -> int:
return hash((self.filepath, self.subpath))
def stat(self) -> os.stat_result:
# NOTE: zip datetimes have no notion of time zone, usually they just keep local time?
# see https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure
dt = datetime(*self.root.getinfo(self.at).date_time)
ts = int(dt.timestamp())
params = dict( # noqa: C408
st_mode=0,
st_ino=0,
st_dev=0,
st_nlink=1,
st_uid=1000,
st_gid=1000,
st_size=0, # todo compute it properly?
st_atime=ts,
st_mtime=ts,
st_ctime=ts,
)
return os.stat_result(tuple(params.values()))
@property
def suffix(self) -> str:
return Path(self.parts[-1]).suffix
# fmt: on

View file

@ -1,8 +1,30 @@
from .common import assert_subpackage; assert_subpackage(__name__)
from __future__ import annotations
from .internal import assert_subpackage
assert_subpackage(__name__)
import logging
import sys
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import Optional
from typing import (
TYPE_CHECKING,
Any,
Callable,
TypeVar,
Union,
cast,
overload,
)
import appdirs # type: ignore[import-untyped]
from . import warnings
PathIsh = Union[str, Path] # avoid circular import from .common
def disable_cachew() -> None:
try:
@ -12,10 +34,10 @@ def disable_cachew() -> None:
return
from cachew import settings
settings.ENABLE = False
from typing import Iterator
@contextmanager
def disabled_cachew() -> Iterator[None]:
try:
@ -25,23 +47,26 @@ def disabled_cachew() -> Iterator[None]:
yield
return
from cachew.extra import disabled_cachew
with disabled_cachew():
yield
def _appdirs_cache_dir() -> Path:
import appdirs # type: ignore
cd = Path(appdirs.user_cache_dir('my'))
cd.mkdir(exist_ok=True, parents=True)
return cd
from . import PathIsh
def cache_dir(suffix: Optional[PathIsh] = None) -> Path:
_CACHE_DIR_NONE_HACK = Path('/tmp/hpi/cachew_none_hack')
def cache_dir(suffix: PathIsh | None = None) -> Path:
from . import core_config as CC
cdir_ = CC.config.get_cache_dir()
sp: Optional[Path] = None
sp: Path | None = None
if suffix is not None:
sp = Path(suffix)
# guess if you do need absolute, better path it directly instead of as suffix?
@ -55,9 +80,84 @@ def cache_dir(suffix: Optional[PathIsh] = None) -> Path:
# this logic is tested via test_cachew_dir_none
if cdir_ is None:
from .common import _CACHE_DIR_NONE_HACK
cdir = _CACHE_DIR_NONE_HACK
else:
cdir = cdir_
return cdir if sp is None else cdir / sp
"""See core.cachew.cache_dir for the explanation"""
_cache_path_dflt = cast(str, object())
# TODO I don't really like 'mcachew', just 'cache' would be better... maybe?
# todo ugh. I think it needs @doublewrap, otherwise @mcachew without args doesn't work
# but it's a bit problematic.. doublewrap works by defecting if the first arg is callable
# but here cache_path can also be a callable (for lazy/dynamic path)... so unclear how to detect this
def _mcachew_impl(cache_path=_cache_path_dflt, **kwargs):
"""
Stands for 'Maybe cachew'.
Defensive wrapper around @cachew to make it an optional dependency.
"""
if cache_path is _cache_path_dflt:
# wasn't specified... so we need to use cache_dir
cache_path = cache_dir()
if isinstance(cache_path, (str, Path)):
try:
# check that it starts with 'hack' path
Path(cache_path).relative_to(_CACHE_DIR_NONE_HACK)
except: # noqa: E722 bare except
pass # no action needed, doesn't start with 'hack' string
else:
# todo show warning? tbh unclear how to detect when user stopped using 'old' way and using suffix instead?
# if it does, means that user wanted to disable cache
cache_path = None
try:
import cachew
except ModuleNotFoundError:
warnings.high('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
return lambda orig_func: orig_func
else:
kwargs['cache_path'] = cache_path
return cachew.cachew(**kwargs)
if TYPE_CHECKING:
R = TypeVar('R')
if sys.version_info[:2] >= (3, 10):
from typing import ParamSpec
else:
from typing_extensions import ParamSpec
P = ParamSpec('P')
CC = Callable[P, R] # need to give it a name, if inlined into bound=, mypy runs in a bug
PathProvider = Union[PathIsh, Callable[P, PathIsh]]
# NOTE: in cachew, HashFunction type returns str
# however in practice, cachew always calls str for its result
# so perhaps better to switch it to Any in cachew as well
HashFunction = Callable[P, Any]
F = TypeVar('F', bound=Callable)
# we need two versions due to @doublewrap
# this is when we just annotate as @cachew without any args
@overload # type: ignore[no-overload-impl]
def mcachew(fun: F) -> F: ...
@overload
def mcachew(
cache_path: PathProvider | None = ...,
*,
force_file: bool = ...,
cls: type | None = ...,
depends_on: HashFunction = ...,
logger: logging.Logger | None = ...,
chunk_by: int = ...,
synthetic_key: str | None = ...,
) -> Callable[[F], F]: ...
else:
mcachew = _mcachew_impl

View file

@ -1,32 +1,42 @@
from typing import TypeVar, Type, Callable, Dict, Any
from __future__ import annotations
Attrs = Dict[str, Any]
import importlib
import re
import sys
from collections.abc import Iterator
from contextlib import ExitStack, contextmanager
from typing import Any, Callable, TypeVar
Attrs = dict[str, Any]
C = TypeVar('C')
# todo not sure about it, could be overthinking...
# but short enough to change later
# TODO document why it's necessary?
def make_config(cls: Type[C], migration: Callable[[Attrs], Attrs]=lambda x: x) -> C:
def make_config(cls: type[C], migration: Callable[[Attrs], Attrs] = lambda x: x) -> C:
user_config = cls.__base__
old_props = {
# NOTE: deliberately use gettatr to 'force' class properties here
k: getattr(user_config, k) for k in vars(user_config)
k: getattr(user_config, k)
for k in vars(user_config)
}
new_props = migration(old_props)
from dataclasses import fields
params = {
k: v
for k, v in new_props.items()
if k in {f.name for f in fields(cls)}
if k in {f.name for f in fields(cls)} # type: ignore[arg-type] # see https://github.com/python/typing_extensions/issues/115
}
# todo maybe return type here?
return cls(**params) # type: ignore[call-arg]
return cls(**params)
F = TypeVar('F')
from contextlib import contextmanager
from typing import Iterator
@contextmanager
def _override_config(config: F) -> Iterator[F]:
'''
@ -44,26 +54,30 @@ def _override_config(config: F) -> Iterator[F]:
delattr(config, k)
import importlib
import sys
from typing import Optional, Set
ModuleRegex = str
@contextmanager
def _reload_modules(modules: ModuleRegex) -> Iterator[None]:
def loaded_modules() -> Set[str]:
return {name for name in sys.modules if re.fullmatch(modules, name)}
# need to use list here, otherwise reordering with set might mess things up
def loaded_modules() -> list[str]:
return [name for name in sys.modules if re.fullmatch(modules, name)]
modules_before = loaded_modules()
for m in modules_before:
# uhh... seems that reversed might make more sense -- not 100% sure why, but this works for tests/reddit.py
for m in reversed(modules_before):
# ugh... seems that reload works whereas pop doesn't work in some cases (e.g. on tests/reddit.py)
# sys.modules.pop(m, None)
importlib.reload(sys.modules[m])
try:
yield
finally:
modules_after = loaded_modules()
modules_before_set = set(modules_before)
for m in modules_after:
if m in modules_before:
if m in modules_before_set:
# was previously loaded, so need to reload to pick up old config
importlib.reload(sys.modules[m])
else:
@ -72,16 +86,15 @@ def _reload_modules(modules: ModuleRegex) -> Iterator[None]:
sys.modules.pop(m, None)
from contextlib import ExitStack
import re
@contextmanager
def tmp_config(*, modules: Optional[ModuleRegex]=None, config=None):
def tmp_config(*, modules: ModuleRegex | None = None, config=None):
if modules is None:
assert config is None
if modules is not None:
assert config is not None
import my.config
with ExitStack() as module_reload_stack, _override_config(my.config) as new_config:
if config is not None:
overrides = {k: v for k, v in vars(config).items() if not k.startswith('__')}
@ -96,6 +109,7 @@ def tmp_config(*, modules: Optional[ModuleRegex]=None, config=None):
def test_tmp_config() -> None:
class extra:
data_path = '/path/to/data'
with tmp_config() as c:
assert c.google != 'whatever'
assert not hasattr(c, 'extra')

View file

@ -1,180 +1,43 @@
from __future__ import annotations
import os
from collections.abc import Iterable, Sequence
from glob import glob as do_glob
from pathlib import Path
from datetime import datetime
import functools
from contextlib import contextmanager
import types
from typing import Union, Callable, Dict, Iterable, TypeVar, Sequence, List, Optional, Any, cast, Tuple, TYPE_CHECKING, NoReturn
import warnings
from . import warnings as core_warnings
from typing import (
TYPE_CHECKING,
Callable,
Generic,
TypeVar,
Union,
)
from . import compat, warnings
# some helper functions
# TODO start deprecating this? soon we'd be able to use Path | str syntax which is shorter and more explicit
PathIsh = Union[Path, str]
# TODO only used in tests? not sure if useful at all.
def import_file(p: PathIsh, name: Optional[str] = None) -> types.ModuleType:
p = Path(p)
if name is None:
name = p.stem
import importlib.util
spec = importlib.util.spec_from_file_location(name, p)
assert spec is not None, f"Fatal error; Could not create module spec from {name} {p}"
foo = importlib.util.module_from_spec(spec)
loader = spec.loader; assert loader is not None
loader.exec_module(foo) # type: ignore[attr-defined]
return foo
def import_from(path: PathIsh, name: str) -> types.ModuleType:
path = str(path)
import sys
try:
sys.path.append(path)
import importlib
return importlib.import_module(name)
finally:
sys.path.remove(path)
def import_dir(path: PathIsh, extra: str='') -> types.ModuleType:
p = Path(path)
if p.parts[0] == '~':
p = p.expanduser() # TODO eh. not sure about this..
return import_from(p.parent, p.name + extra)
T = TypeVar('T')
K = TypeVar('K')
V = TypeVar('V')
# TODO deprecate? more_itertools.one should be used
def the(l: Iterable[T]) -> T:
it = iter(l)
try:
first = next(it)
except StopIteration:
raise RuntimeError('Empty iterator?')
assert all(e == first for e in it)
return first
# TODO more_itertools.bucket?
def group_by_key(l: Iterable[T], key: Callable[[T], K]) -> Dict[K, List[T]]:
res: Dict[K, List[T]] = {}
for i in l:
kk = key(i)
lst = res.get(kk, [])
lst.append(i)
res[kk] = lst
return res
def _identity(v: T) -> V: # type: ignore[type-var]
return cast(V, v)
# ugh. nothing in more_itertools?
def ensure_unique(
it: Iterable[T],
*,
key: Callable[[T], K],
value: Callable[[T], V]=_identity,
key2value: Optional[Dict[K, V]]=None
) -> Iterable[T]:
if key2value is None:
key2value = {}
for i in it:
k = key(i)
v = value(i)
pv = key2value.get(k, None) # type: ignore
if pv is not None:
raise RuntimeError(f"Duplicate key: {k}. Previous value: {pv}, new value: {v}")
key2value[k] = v
yield i
def test_ensure_unique() -> None:
import pytest # type: ignore
assert list(ensure_unique([1, 2, 3], key=lambda i: i)) == [1, 2, 3]
dups = [1, 2, 1, 4]
# this works because it's lazy
it = ensure_unique(dups, key=lambda i: i)
# but forcing throws
with pytest.raises(RuntimeError, match='Duplicate key'):
list(it)
# hacky way to force distinct objects?
list(ensure_unique(dups, key=lambda i: object()))
def make_dict(
it: Iterable[T],
*,
key: Callable[[T], K],
value: Callable[[T], V]=_identity
) -> Dict[K, V]:
res: Dict[K, V] = {}
uniques = ensure_unique(it, key=key, value=value, key2value=res)
for _ in uniques:
pass # force the iterator
return res
def test_make_dict() -> None:
it = range(5)
d = make_dict(it, key=lambda i: i, value=lambda i: i % 2)
assert d == {0: 0, 1: 1, 2: 0, 3: 1, 4: 0}
# check type inference
d2: Dict[str, int ] = make_dict(it, key=lambda i: str(i))
d3: Dict[str, bool] = make_dict(it, key=lambda i: str(i), value=lambda i: i % 2 == 0)
# https://stackoverflow.com/a/12377059/706389
def listify(fn=None, wrapper=list):
"""
Wraps a function's return value in wrapper (e.g. list)
Useful when an algorithm can be expressed more cleanly as a generator
"""
def listify_return(fn):
@functools.wraps(fn)
def listify_helper(*args, **kw):
return wrapper(fn(*args, **kw))
return listify_helper
if fn is None:
return listify_return
return listify_return(fn)
# todo use in bluemaestro
# def dictify(fn=None, key=None, value=None):
# def md(it):
# return make_dict(it, key=key, value=value)
# return listify(fn=fn, wrapper=md)
from .logging import setup_logger, LazyLogger
Paths = Union[Sequence[PathIsh], PathIsh]
DEFAULT_GLOB = '*'
def get_files(
pp: Paths,
glob: str = DEFAULT_GLOB,
*,
sort: bool = True,
guess_compression: bool = True,
) -> Tuple[Path, ...]:
) -> tuple[Path, ...]:
"""
Helper function to avoid boilerplate.
Tuple as return type is a bit friendlier for hashing/caching, so hopefully makes sense
"""
# TODO FIXME mm, some wrapper to assert iterator isn't empty?
sources: List[Path]
sources: list[Path]
if isinstance(pp, Path):
sources = [pp]
elif isinstance(pp, str):
@ -183,14 +46,15 @@ def get_files(
return () # early return to prevent warnings etc
sources = [Path(pp)]
else:
sources = [Path(p) for p in pp]
sources = [p if isinstance(p, Path) else Path(p) for p in pp]
def caller() -> str:
import traceback
# TODO ugh. very flaky... -3 because [<this function>, get_files(), <actual caller>]
return traceback.extract_stack()[-3].filename
paths: List[Path] = []
paths: list[Path] = []
for src in sources:
if src.parts[0] == '~':
src = src.expanduser()
@ -198,140 +62,81 @@ def get_files(
gs = str(src)
if '*' in gs:
if glob != DEFAULT_GLOB:
warnings.warn(f"{caller()}: treating {gs} as glob path. Explicit glob={glob} argument is ignored!")
paths.extend(map(Path, do_glob(gs)))
elif src.is_dir():
warnings.medium(f"{caller()}: treating {gs} as glob path. Explicit glob={glob} argument is ignored!")
paths.extend(map(Path, do_glob(gs))) # noqa: PTH207
elif os.path.isdir(str(src)): # noqa: PTH112
# NOTE: we're using os.path here on purpose instead of src.is_dir
# the reason is is_dir for archives might return True and then
# this clause would try globbing insize the archives
# this is generally undesirable (since modules handle archives themselves)
# todo not sure if should be recursive?
# note: glob='**/*.ext' works without any changes.. so perhaps it's ok as it is
gp: Iterable[Path] = src.glob(glob)
paths.extend(gp)
else:
if not src.is_file():
# todo not sure, might be race condition?
raise RuntimeError(f"Expected '{src}' to exist")
assert src.exists(), src
# todo assert matches glob??
paths.append(src)
if sort:
paths = list(sorted(paths))
paths = sorted(paths)
if len(paths) == 0:
# todo make it conditionally defensive based on some global settings
core_warnings.high(f'''
warnings.high(f'''
{caller()}: no paths were matched against {pp}. This might result in missing data. Likely, the directory you passed is empty.
'''.strip())
# traceback is useful to figure out what config caused it?
import traceback
traceback.print_stack()
if guess_compression:
from .kompress import CPath, is_compressed
paths = [CPath(p) if is_compressed(p) else p for p in paths]
from .kompress import CPath, ZipPath, is_compressed
# NOTE: wrap is just for backwards compat with vendorized kompress
# with kompress library, only is_compressed check and Cpath should be enough
def wrap(p: Path) -> Path:
if isinstance(p, ZipPath):
return p
if p.suffix == '.zip':
return ZipPath(p) # type: ignore[return-value]
if is_compressed(p):
return CPath(p)
return p
paths = [wrap(p) for p in paths]
return tuple(paths)
# TODO annotate it, perhaps use 'dependent' type (for @doublewrap stuff)
if TYPE_CHECKING:
from typing import Callable, TypeVar
from typing_extensions import Protocol
# TODO reuse types from cachew? although not sure if we want hard dependency on it in typecheck time..
# I guess, later just define pass through once this is fixed: https://github.com/python/typing/issues/270
# ok, that's actually a super nice 'pattern'
F = TypeVar('F')
class McachewType(Protocol):
def __call__(
self,
cache_path: Any=None,
*,
hashf: Any=None, # todo deprecate
depends_on: Any=None,
force_file: bool=False,
chunk_by: int=0,
logger: Any=None,
) -> Callable[[F], F]:
...
mcachew: McachewType
_CACHE_DIR_NONE_HACK = Path('/tmp/hpi/cachew_none_hack')
"""See core.cachew.cache_dir for the explanation"""
_cache_path_dflt = cast(str, object())
# TODO I don't really like 'mcachew', just 'cache' would be better... maybe?
# todo ugh. I think it needs @doublewrap, otherwise @mcachew without args doesn't work
# but it's a bit problematic.. doublewrap works by defecting if the first arg is callable
# but here cache_path can also be a callable (for lazy/dynamic path)... so unclear how to detect this
def mcachew(cache_path=_cache_path_dflt, **kwargs): # type: ignore[no-redef]
"""
Stands for 'Maybe cachew'.
Defensive wrapper around @cachew to make it an optional dependency.
"""
if cache_path is _cache_path_dflt:
# wasn't specified... so we need to use cache_dir
from .cachew import cache_dir
cache_path = cache_dir()
if isinstance(cache_path, (str, Path)):
try:
# check that it starts with 'hack' path
Path(cache_path).relative_to(_CACHE_DIR_NONE_HACK)
except: # noqa: E722 bare except
pass # no action needed, doesn't start with 'hack' string
else:
# todo show warning? tbh unclear how to detect when user stopped using 'old' way and using suffix instead?
# if it does, means that user wanted to disable cache
cache_path = None
try:
import cachew
except ModuleNotFoundError:
warnings.warn('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
return lambda orig_func: orig_func
else:
kwargs['cache_path'] = cache_path
return cachew.cachew(**kwargs)
@functools.lru_cache(1)
def _magic():
import magic # type: ignore
return magic.Magic(mime=True)
# TODO could reuse in pdf module?
import mimetypes # todo do I need init()?
# todo wtf? fastermime thinks it's mime is application/json even if the extension is xz??
# whereas magic detects correctly: application/x-zstd and application/x-xz
def fastermime(path: PathIsh) -> str:
paths = str(path)
# mimetypes is faster
(mime, _) = mimetypes.guess_type(paths)
if mime is not None:
return mime
# magic is slower but returns more stuff
# TODO Result type?; it's kinda racey, but perhaps better to let the caller decide?
return _magic().from_file(paths)
Json = Dict[str, Any]
from typing import TypeVar, Callable, Generic
_C = TypeVar('_C')
_R = TypeVar('_R')
# https://stackoverflow.com/a/5192374/706389
# NOTE: it was added to stdlib in 3.9 and then deprecated in 3.11
# seems that the suggested solution is to use custom decorator?
class classproperty(Generic[_R]):
def __init__(self, f: Callable[[_C], _R]) -> None:
def __init__(self, f: Callable[..., _R]) -> None:
self.f = f
def __get__(self, obj: None, cls: _C) -> _R:
def __get__(self, obj, cls) -> _R:
return self.f(cls)
def test_classproperty() -> None:
from .compat import assert_type
class C:
@classproperty
def prop(cls) -> str:
return 'hello'
res = C.prop
assert_type(res, str)
assert res == 'hello'
# hmm, this doesn't really work with mypy well..
# https://github.com/python/mypy/issues/6244
# class staticproperty(Generic[_R]):
@ -341,310 +146,117 @@ class classproperty(Generic[_R]):
# def __get__(self) -> _R:
# return self.f()
# TODO deprecate in favor of datetime_aware
tzdatetime = datetime
# TODO doctests?
def isoparse(s: str) -> tzdatetime:
"""
Parses timestamps formatted like 2020-05-01T10:32:02.925961Z
"""
# TODO could use dateutil? but it's quite slow as far as I remember..
# TODO support non-utc.. somehow?
assert s.endswith('Z'), s
s = s[:-1] + '+00:00'
return datetime.fromisoformat(s)
# legacy import -- we should use compat directly instead
from .compat import Literal
import re
# https://stackoverflow.com/a/295466/706389
def get_valid_filename(s: str) -> str:
s = str(s).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', s)
from typing import Generic, Sized, Callable
# TODO deprecate and suggest to use one from my.core directly? not sure
from .utils.itertools import unique_everseen # noqa: F401
### legacy imports, keeping them here for backwards compatibility
## hiding behind TYPE_CHECKING so it works in runtime
## in principle, warnings.deprecated decorator should cooperate with mypy, but doesn't look like it works atm?
## perhaps it doesn't work when it's used from typing_extensions
# X = TypeVar('X')
def _warn_iterator(it, f: Any=None):
emitted = False
for i in it:
yield i
emitted = True
if not emitted:
warnings.warn(f"Function {f} didn't emit any data, make sure your config paths are correct")
if not TYPE_CHECKING:
from .compat import deprecated
@deprecated('use my.core.compat.assert_never instead')
def assert_never(*args, **kwargs):
return compat.assert_never(*args, **kwargs)
# TODO ugh, so I want to express something like:
# X = TypeVar('X')
# C = TypeVar('C', bound=Iterable[X])
# _warn_iterable(it: C) -> C
# but apparently I can't??? ugh.
# https://github.com/python/typing/issues/548
# I guess for now overloads are fine...
@deprecated('use my.core.compat.fromisoformat instead')
def isoparse(*args, **kwargs):
return compat.fromisoformat(*args, **kwargs)
from typing import overload
X = TypeVar('X')
@overload
def _warn_iterable(it: List[X] , f: Any=None) -> List[X] : ...
@overload
def _warn_iterable(it: Iterable[X], f: Any=None) -> Iterable[X]: ...
def _warn_iterable(it, f=None):
if isinstance(it, Sized):
sz = len(it)
if sz == 0:
warnings.warn(f"Function {f} returned empty container, make sure your config paths are correct")
return it
else:
return _warn_iterator(it, f=f)
@deprecated('use more_itertools.one instead')
def the(*args, **kwargs):
import more_itertools
return more_itertools.one(*args, **kwargs)
# ok, this seems to work...
# https://github.com/python/mypy/issues/1927#issue-167100413
FL = TypeVar('FL', bound=Callable[..., List])
FI = TypeVar('FI', bound=Callable[..., Iterable])
@deprecated('use functools.cached_property instead')
def cproperty(*args, **kwargs):
import functools
@overload
def warn_if_empty(f: FL) -> FL: ...
@overload
def warn_if_empty(f: FI) -> FI: ...
return functools.cached_property(*args, **kwargs)
def warn_if_empty(f):
from functools import wraps
@wraps(f)
def wrapped(*args, **kwargs):
res = f(*args, **kwargs)
return _warn_iterable(res, f=f)
return wrapped # type: ignore
# global state that turns on/off quick stats
# can use the 'quick_stats' contextmanager
# to enable/disable this in cli so that module 'stats'
# functions don't have to implement custom 'quick' logic
QUICK_STATS = False
# incase user wants to use the stats functions/quick option
# elsewhere -- can use this decorator instead of editing
# the global state directly
@contextmanager
def quick_stats():
global QUICK_STATS
prev = QUICK_STATS
try:
QUICK_STATS = True
yield
finally:
QUICK_STATS = prev
C = TypeVar('C')
Stats = Dict[str, Any]
StatsFun = Callable[[], Stats]
# todo not sure about return type...
def stat(func: Union[Callable[[], Iterable[C]], Iterable[C]], quick: bool=False) -> Stats:
if callable(func):
fr = func()
fname = func.__name__
else:
# meh. means it's just a list.. not sure how to generate a name then
fr = func
fname = f'unnamed_{id(fr)}'
tname = type(fr).__name__
if tname == 'DataFrame':
# dynamic, because pandas is an optional dependency..
df = cast(Any, fr) # todo ugh, not sure how to annotate properly
res = dict(
dtypes=df.dtypes.to_dict(),
rows=len(df),
)
else:
res = _stat_iterable(fr, quick=quick)
return {
fname: res,
}
def _stat_iterable(it: Iterable[C], quick: bool=False) -> Any:
from more_itertools import ilen, take, first
# todo not sure if there is something in more_itertools to compute this?
total = 0
errors = 0
last = None
def funcit():
nonlocal errors, last, total
for x in it:
total += 1
if isinstance(x, Exception):
errors += 1
else:
last = x
yield x
eit = funcit()
count: Any
if quick or QUICK_STATS:
initial = take(100, eit)
count = len(initial)
if first(eit, None) is not None: # todo can actually be none...
# haven't exhausted
count = f'{count}+'
else:
count = ilen(eit)
res = {
'count': count,
}
if total == 0:
# not sure but I guess a good balance? wouldn't want to throw early here?
res['warning'] = 'THE ITERABLE RETURNED NO DATA'
if errors > 0:
res['errors'] = errors
if last is not None:
dt = guess_datetime(last)
if dt is not None:
res['last'] = dt
@deprecated('use more_itertools.bucket instead')
def group_by_key(l, key):
res = {}
for i in l:
kk = key(i)
lst = res.get(kk, [])
lst.append(i)
res[kk] = lst
return res
@deprecated('use my.core.utils.itertools.make_dict instead')
def make_dict(*args, **kwargs):
from .utils import itertools as UI
def test_stat_iterable() -> None:
from datetime import datetime, timedelta
from typing import NamedTuple
return UI.make_dict(*args, **kwargs)
dd = datetime.utcfromtimestamp(123)
day = timedelta(days=3)
@deprecated('use my.core.utils.itertools.listify instead')
def listify(*args, **kwargs):
from .utils import itertools as UI
X = NamedTuple('X', [('x', int), ('d', datetime)])
return UI.listify(*args, **kwargs)
def it():
yield RuntimeError('oops!')
for i in range(2):
yield X(x=i, d=dd + day * i)
yield RuntimeError('bad!')
for i in range(3):
yield X(x=i * 10, d=dd + day * (i * 10))
yield X(x=123, d=dd + day * 50)
@deprecated('use my.core.warn_if_empty instead')
def warn_if_empty(*args, **kwargs):
from .utils import itertools as UI
res = _stat_iterable(it())
assert res['count'] == 1 + 2 + 1 + 3 + 1
assert res['errors'] == 1 + 1
assert res['last'] == dd + day * 50
return UI.listify(*args, **kwargs)
@deprecated('use my.core.stat instead')
def stat(*args, **kwargs):
from . import stats
# experimental, not sure about it..
def guess_datetime(x: Any) -> Optional[datetime]:
# todo hmm implement withoutexception..
try:
d = asdict(x)
except: # noqa: E722 bare except
return None
for k, v in d.items():
if isinstance(v, datetime):
return v
return None
return stats.stat(*args, **kwargs)
def test_guess_datetime() -> None:
from datetime import datetime
from dataclasses import dataclass
from typing import NamedTuple
@deprecated('use my.core.make_logger instead')
def LazyLogger(*args, **kwargs):
from . import logging
dd = isoparse('2021-02-01T12:34:56Z')
return logging.LazyLogger(*args, **kwargs)
# ugh.. https://github.com/python/mypy/issues/7281
A = NamedTuple('A', [('x', int)])
B = NamedTuple('B', [('x', int), ('created', datetime)])
@deprecated('use my.core.types.asdict instead')
def asdict(*args, **kwargs):
from . import types
assert guess_datetime(A(x=4)) is None
assert guess_datetime(B(x=4, created=dd)) == dd
return types.asdict(*args, **kwargs)
@dataclass
class C:
a: datetime
x: int
assert guess_datetime(C(a=dd, x=435)) == dd
# TODO not sure what to return when multiple datetime fields?
# TODO test @property?
# todo wrap these in deprecated decorator as well?
# TODO hmm how to deprecate these in runtime?
# tricky cause they are actually classes/types
from typing import Literal # noqa: F401
from .cachew import mcachew # noqa: F401
def is_namedtuple(thing: Any) -> bool:
# basic check to see if this is namedtuple-like
_asdict = getattr(thing, '_asdict', None)
return (_asdict is not None) and callable(_asdict)
# this is kinda internal, should just use my.core.logging.setup_logger if necessary
from .logging import setup_logger
from .stats import Stats
from .types import (
Json,
datetime_aware,
datetime_naive,
)
def asdict(thing: Any) -> Json:
# todo primitive?
# todo exception?
if isinstance(thing, dict):
return thing
import dataclasses as D
if D.is_dataclass(thing):
return D.asdict(thing)
if is_namedtuple(thing):
return thing._asdict()
raise TypeError(f'Could not convert object {thing} to dict')
# for now just serves documentation purposes... but one day might make it statically verifiable where possible?
# TODO e.g. maybe use opaque mypy alias?
datetime_naive = datetime
datetime_aware = datetime
def assert_subpackage(name: str) -> None:
# can lead to some unexpected issues if you 'import cachew' which being in my/core directory.. so let's protect against it
# NOTE: if we use overlay, name can be smth like my.origg.my.core.cachew ...
assert name == '__main__' or 'my.core' in name, f'Expected module __name__ ({name}) to be __main__ or start with my.core'
# https://stackoverflow.com/a/10436851/706389
from concurrent.futures import Future, Executor
class DummyExecutor(Executor):
def __init__(self, max_workers: Optional[int]=1) -> None:
self._shutdown = False
self._max_workers = max_workers
# TODO: once support for 3.7 is dropped,
# can make 'fn' a positional only parameter,
# which fixes the mypy error this throws without the type: ignore
def submit(self, fn, *args, **kwargs) -> Future: # type: ignore[override]
if self._shutdown:
raise RuntimeError('cannot schedule new futures after shutdown')
f: Future[Any] = Future()
try:
result = fn(*args, **kwargs)
except KeyboardInterrupt:
raise
except BaseException as e:
f.set_exception(e)
tzdatetime = datetime_aware
else:
f.set_result(result)
from .compat import Never
return f
def shutdown(self, wait: bool=True) -> None: # type: ignore[override]
self._shutdown = True
# see https://hakibenita.com/python-mypy-exhaustive-checking#exhaustiveness-checking
def assert_never(value: NoReturn) -> NoReturn:
assert False, f'Unhandled value: {value} ({type(value).__name__})'
# legacy deprecated import
from .compat import cached_property as cproperty
# make these invalid during type check while working in runtime
Stats = Never
tzdatetime = Never
Json = Never
datetime_naive = Never
datetime_aware = Never
###

View file

@ -1,127 +1,139 @@
'''
Some backwards compatibility stuff/deprecation helpers
Contains backwards compatibility helpers for different python versions.
If something is relevant to HPI itself, please put it in .hpi_compat instead
'''
from __future__ import annotations
import sys
from types import ModuleType
from . import warnings
from .common import LazyLogger
logger = LazyLogger('my.core.compat')
def pre_pip_dal_handler(
name: str,
e: ModuleNotFoundError,
cfg,
requires=[],
) -> ModuleType:
'''
https://github.com/karlicoss/HPI/issues/79
'''
if e.name != name:
# the module itself was imported, so the problem is with some dependencies
raise e
try:
dal = _get_dal(cfg, name)
warnings.high(f'''
Specifying modules' dependencies in the config or in my/config/repos is deprecated!
Please install {' '.join(requires)} as PIP packages (see the corresponding README instructions).
'''.strip(), stacklevel=2)
except ModuleNotFoundError:
dal = None
if dal is None:
# probably means there was nothing in the old config in the first place
# so we should raise the original exception
raise e
return dal
def _get_dal(cfg, module_name: str):
mpath = getattr(cfg, module_name, None)
if mpath is not None:
from .common import import_dir
return import_dir(mpath, '.dal')
else:
from importlib import import_module
return import_module(f'my.config.repos.{module_name}.dal')
import os
windows = os.name == 'nt'
import sqlite3
def sqlite_backup(*, source: sqlite3.Connection, dest: sqlite3.Connection, **kwargs) -> None:
if sys.version_info[:2] >= (3, 7):
source.backup(dest, **kwargs)
else:
# https://stackoverflow.com/a/10856450/706389
import io
tempfile = io.StringIO()
for line in source.iterdump():
tempfile.write('%s\n' % line)
tempfile.seek(0)
dest.cursor().executescript(tempfile.read())
dest.commit()
# can remove after python3.9
def removeprefix(text: str, prefix: str) -> str:
if text.startswith(prefix):
return text[len(prefix):]
return text
# can remove after python3.8
if sys.version_info[:2] >= (3, 8):
from functools import cached_property
else:
from typing import TypeVar, Callable
Cl = TypeVar('Cl')
R = TypeVar('R')
def cached_property(f: Callable[[Cl], R]) -> R:
import functools
return property(functools.lru_cache(maxsize=1)(f)) # type: ignore
del Cl
del R
from typing import TYPE_CHECKING
if sys.version_info[:2] >= (3, 8):
from typing import Literal
if sys.version_info[:2] >= (3, 13):
from warnings import deprecated
else:
if TYPE_CHECKING:
from typing_extensions import Literal
else:
# erm.. I guess as long as it's not crashing, whatever...
class _Literal:
def __getitem__(self, args):
pass
Literal = _Literal()
from typing_extensions import deprecated
if sys.version_info[:2] >= (3, 8):
from typing import Protocol
else:
if TYPE_CHECKING:
from typing_extensions import Protocol # type: ignore[misc]
else:
# todo could also use NamedTuple?
Protocol = object
# keeping just for backwards compatibility, used to have compat implementation for 3.6
if not TYPE_CHECKING:
import sqlite3
@deprecated('use .backup method on sqlite3.Connection directly instead')
def sqlite_backup(*, source: sqlite3.Connection, dest: sqlite3.Connection, **kwargs) -> None:
# TODO warn here?
source.backup(dest, **kwargs)
# keeping for runtime backwards compatibility (added in 3.9)
@deprecated('use .removeprefix method on string directly instead')
def removeprefix(text: str, prefix: str) -> str:
return text.removeprefix(prefix)
@deprecated('use .removesuffix method on string directly instead')
def removesuffix(text: str, suffix: str) -> str:
return text.removesuffix(suffix)
##
## used to have compat function before 3.8 for these, keeping for runtime back compatibility
from functools import cached_property
from typing import Literal, Protocol, TypedDict
##
if sys.version_info[:2] >= (3, 8):
from typing import TypedDict
if sys.version_info[:2] >= (3, 10):
from typing import ParamSpec
else:
if TYPE_CHECKING:
from typing_extensions import TypedDict # type: ignore[misc]
from typing_extensions import ParamSpec
# bisect_left doesn't have a 'key' parameter (which we use)
# till python3.10
if sys.version_info[:2] <= (3, 9):
from typing import Any, Callable, List, Optional, TypeVar # noqa: UP035
X = TypeVar('X')
# copied from python src
# fmt: off
def bisect_left(a: list[Any], x: Any, lo: int=0, hi: int | None=None, *, key: Callable[..., Any] | None=None) -> int:
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
# Note, the comparison uses "<" to match the
# __lt__() logic in list.sort() and in heapq.
if key is None:
while lo < hi:
mid = (lo + hi) // 2
if a[mid] < x:
lo = mid + 1
else:
from typing import Dict
TypedDict = Dict
hi = mid
else:
while lo < hi:
mid = (lo + hi) // 2
if key(a[mid]) < x:
lo = mid + 1
else:
hi = mid
return lo
# fmt: on
else:
from bisect import bisect_left
from datetime import datetime
if sys.version_info[:2] >= (3, 11):
fromisoformat = datetime.fromisoformat
else:
# fromisoformat didn't support Z as "utc" before 3.11
# https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat
def fromisoformat(date_string: str) -> datetime:
if date_string.endswith('Z'):
date_string = date_string[:-1] + '+00:00'
return datetime.fromisoformat(date_string)
def test_fromisoformat() -> None:
from datetime import timezone
# fmt: off
# feedbin has this format
assert fromisoformat('2020-05-01T10:32:02.925961Z') == datetime(
2020, 5, 1, 10, 32, 2, 925961, timezone.utc,
)
# polar has this format
assert fromisoformat('2018-11-28T22:04:01.304Z') == datetime(
2018, 11, 28, 22, 4, 1, 304000, timezone.utc,
)
# stackexchange, runnerup has this format
assert fromisoformat('2020-11-30T00:53:12Z') == datetime(
2020, 11, 30, 0, 53, 12, 0, timezone.utc,
)
# fmt: on
# arbtt has this format (sometimes less/more than 6 digits in milliseconds)
# TODO doesn't work atm, not sure if really should be supported...
# maybe should have flags for weird formats?
# assert isoparse('2017-07-18T18:59:38.21731Z') == datetime(
# 2017, 7, 18, 18, 59, 38, 217310, timezone.utc,
# )
if sys.version_info[:2] >= (3, 10):
from types import NoneType
from typing import TypeAlias
else:
NoneType = type(None)
from typing_extensions import TypeAlias
if sys.version_info[:2] >= (3, 11):
from typing import Never, assert_never, assert_type
else:
from typing_extensions import Never, assert_never, assert_type

View file

@ -1,27 +1,33 @@
'''
Bindings for the 'core' HPI configuration
'''
import re
from typing import Sequence, Optional
from . import warnings, PathIsh, Path
from __future__ import annotations
import re
from collections.abc import Sequence
from dataclasses import dataclass
from pathlib import Path
from . import warnings
try:
from my.config import core as user_config # type: ignore[attr-defined]
except Exception as e:
try:
from my.config import common as user_config # type: ignore[attr-defined, assignment, misc]
from my.config import common as user_config # type: ignore[attr-defined]
warnings.high("'common' config section is deprecated. Please rename it to 'core'.")
except Exception as e2:
# make it defensive, because it's pretty commonly used and would be annoying if it breaks hpi doctor etc.
# this way it'll at least use the defaults
# todo actually not sure if needs a warning? Perhaps it's okay without it, because the defaults are reasonable enough
user_config = object # type: ignore[assignment, misc]
user_config = object
_HPI_CACHE_DIR_DEFAULT = ''
from dataclasses import dataclass
@dataclass
class Config(user_config):
'''
@ -32,7 +38,7 @@ class Config(user_config):
cache_dir = '/your/custom/cache/path'
'''
cache_dir: Optional[PathIsh] = _HPI_CACHE_DIR_DEFAULT
cache_dir: Path | str | None = _HPI_CACHE_DIR_DEFAULT
'''
Base directory for cachew.
- if None , means cache is disabled
@ -42,7 +48,7 @@ class Config(user_config):
NOTE: you shouldn't use this attribute in HPI modules directly, use Config.get_cache_dir()/cachew.cache_dir() instead
'''
tmp_dir: Optional[PathIsh] = None
tmp_dir: Path | str | None = None
'''
Path to a temporary directory.
This can be used temporarily while extracting zipfiles etc...
@ -50,34 +56,36 @@ class Config(user_config):
- otherwise , use the specified directory as the base temporary directory
'''
enabled_modules : Optional[Sequence[str]] = None
enabled_modules: Sequence[str] | None = None
'''
list of regexes/globs
- None means 'rely on disabled_modules'
'''
disabled_modules: Optional[Sequence[str]] = None
disabled_modules: Sequence[str] | None = None
'''
list of regexes/globs
- None means 'rely on enabled_modules'
'''
def get_cache_dir(self) -> Optional[Path]:
def get_cache_dir(self) -> Path | None:
cdir = self.cache_dir
if cdir is None:
return None
if cdir == _HPI_CACHE_DIR_DEFAULT:
from .cachew import _appdirs_cache_dir
return _appdirs_cache_dir()
else:
return Path(cdir).expanduser()
def get_tmp_dir(self) -> Path:
tdir: Optional[PathIsh] = self.tmp_dir
tdir: Path | str | None = self.tmp_dir
tpath: Path
# use tempfile if unset
if tdir is None:
import tempfile
tpath = Path(tempfile.gettempdir()) / 'HPI'
else:
tpath = Path(tdir)
@ -85,10 +93,10 @@ class Config(user_config):
tpath.mkdir(parents=True, exist_ok=True)
return tpath
def _is_module_active(self, module: str) -> Optional[bool]:
def _is_module_active(self, module: str) -> bool | None:
# None means the config doesn't specify anything
# todo might be nice to return the 'reason' too? e.g. which option has matched
def matches(specs: Sequence[str]) -> Optional[str]:
def matches(specs: Sequence[str]) -> str | None:
for spec in specs:
# not sure because . (packages separate) matches anything, but I guess unlikely to clash
if re.match(spec, module):
@ -114,12 +122,15 @@ class Config(user_config):
from .cfg import make_config
config = make_config(Config)
### tests start
from typing import Iterator
from collections.abc import Iterator
from contextlib import contextmanager as ctx
@ctx
def _reset_config() -> Iterator[Config]:
# todo maybe have this decorator for the whole of my.config?
@ -146,7 +157,7 @@ def test_active_modules() -> None:
cc.disabled_modules = ['my.body.*']
assert cc._is_module_active('my.whatever' ) is True
assert cc._is_module_active('my.core' ) is None
assert not cc._is_module_active('my.body.exercise') is True
assert cc._is_module_active('my.body.exercise') is False
with reset() as cc:
# if both are set, enable all
@ -158,4 +169,5 @@ def test_active_modules() -> None:
assert cc._is_module_active("my.body.exercise") is True
assert len(record_warnings) == 1
### tests end

View file

@ -1,32 +1,5 @@
from __future__ import annotations
from .common import assert_subpackage; assert_subpackage(__name__)
from . import warnings
from .common import PathIsh
from .compat import Protocol
from .sqlite import sqlite_connect_immutable
warnings.high(f"{__name__} is deprecated, please use dataset directly if you need or switch to my.core.sqlite")
## sadly dataset doesn't have any type definitions
from typing import Iterable, Iterator, Dict, Optional, Any
from contextlib import AbstractContextManager
# NOTE: may not be true in general, but will be in the vast majority of cases
row_type_T = Dict[str, Any]
class TableT(Iterable, Protocol):
def find(self, *, order_by: Optional[str]=None) -> Iterator[row_type_T]: ...
class DatabaseT(AbstractContextManager['DatabaseT'], Protocol):
def __getitem__(self, table: str) -> TableT: ...
##
# TODO wonder if also need to open without WAL.. test this on read-only directory/db file
def connect_readonly(db: PathIsh) -> DatabaseT:
import dataset # type: ignore
# see https://github.com/pudo/dataset/issues/136#issuecomment-128693122
# todo not sure if mode=ro has any benefit, but it doesn't work on read-only filesystems
# maybe it should autodetect readonly filesystems and apply this? not sure
creator = lambda: sqlite_connect_immutable(db)
return dataset.connect('sqlite:///', engine_kwargs={'creator': creator})
from ._deprecated.dataset import *

179
my/core/denylist.py Normal file
View file

@ -0,0 +1,179 @@
"""
A helper module for defining denylists for sources programmatically
(in lamens terms, this lets you remove some output from a module you don't want)
For docs, see doc/DENYLIST.md
"""
from __future__ import annotations
import functools
import json
import sys
from collections import defaultdict
from collections.abc import Iterator, Mapping
from pathlib import Path
from typing import Any, TypeVar
import click
from more_itertools import seekable
from .serialize import dumps
from .warnings import medium
T = TypeVar("T")
DenyMap = Mapping[str, set[Any]]
def _default_key_func(obj: T) -> str:
return str(obj)
class DenyList:
def __init__(self, denylist_file: Path | str) -> None:
self.file = Path(denylist_file).expanduser().absolute()
self._deny_raw_list: list[dict[str, Any]] = []
self._deny_map: DenyMap = defaultdict(set)
# deny cli, user can override these
self.fzf_path = None
self._fzf_options = ()
self._deny_cli_key_func = None
def _load(self) -> None:
if not self.file.exists():
medium(f"denylist file {self.file} does not exist")
return
deny_map: DenyMap = defaultdict(set)
data: list[dict[str, Any]] = json.loads(self.file.read_text())
self._deny_raw_list = data
for ignore in data:
for k, v in ignore.items():
deny_map[k].add(v)
self._deny_map = deny_map
def load(self) -> DenyMap:
self._load()
return self._deny_map
def write(self) -> None:
if not self._deny_raw_list:
medium("no denylist data to write")
return
self.file.write_text(json.dumps(self._deny_raw_list))
@classmethod
def _is_json_primitive(cls, val: Any) -> bool:
return isinstance(val, (str, int, float, bool, type(None)))
@classmethod
def _stringify_value(cls, val: Any) -> Any:
# if it's a primitive, just return it
if cls._is_json_primitive(val):
return val
# otherwise, stringify-and-back so we can compare to
# json data loaded from the denylist file
return json.loads(dumps(val))
@classmethod
def _allow(cls, obj: T, deny_map: DenyMap) -> bool:
for deny_key, deny_set in deny_map.items():
# this should be done separately and not as part of the getattr
# because 'null'/None could actually be a value in the denylist,
# and the user may define behavior to filter that out
if not hasattr(obj, deny_key):
return False
val = cls._stringify_value(getattr(obj, deny_key))
# this object doesn't have have the attribute in the denylist
if val in deny_set:
return False
# if we tried all the denylist keys and didn't return False,
# then this object is allowed
return True
def filter(
self,
itr: Iterator[T],
*,
invert: bool = False,
) -> Iterator[T]:
denyf = functools.partial(self._allow, deny_map=self.load())
if invert:
return filter(lambda x: not denyf(x), itr)
return filter(denyf, itr)
def deny(self, key: str, value: Any, *, write: bool = False) -> None:
'''
add a key/value pair to the denylist
'''
if not self._deny_raw_list:
self._load()
self._deny_raw({key: self._stringify_value(value)}, write=write)
def _deny_raw(self, data: dict[str, Any], *, write: bool = False) -> None:
self._deny_raw_list.append(data)
if write:
self.write()
def _prompt_keys(self, item: T) -> str:
import pprint
click.echo(pprint.pformat(item))
# TODO: extract keys from item by checking if its dataclass/NT etc.?
resp = click.prompt("Key to deny on").strip()
if not hasattr(item, resp):
click.echo(f"Could not find key '{resp}' on item", err=True)
return self._prompt_keys(item)
return resp
def _deny_cli_remember(
self,
items: Iterator[T],
mem: dict[str, T],
) -> Iterator[str]:
keyf = self._deny_cli_key_func or _default_key_func
# i.e., convert each item to a string, and map str -> item
for item in items:
key = keyf(item)
mem[key] = item
yield key
def deny_cli(self, itr: Iterator[T]) -> None:
try:
from pyfzf import FzfPrompt
except ImportError:
click.echo("pyfzf is required to use the denylist cli, run 'python3 -m pip install pyfzf_iter'", err=True)
sys.exit(1)
# wrap in seekable so we can use it multiple times
# progressively caches the items as we iterate over them
sit = seekable(itr)
prompt_continue = True
while prompt_continue:
# reset the iterator
sit.seek(0)
# so we can map the selected string from fzf back to the original objects
memory_map: dict[str, T] = {}
picker = FzfPrompt(executable_path=self.fzf_path, default_options="--no-multi")
picked_l = picker.prompt(
self._deny_cli_remember(itr, memory_map),
"--read0",
*self._fzf_options,
delimiter="\0",
)
assert isinstance(picked_l, list)
if picked_l:
picked: T = memory_map[picked_l[0]]
key = self._prompt_keys(picked)
self.deny(key, getattr(picked, key), write=True)
click.echo(f"Added {self._deny_raw_list[-1]} to denylist", err=True)
else:
click.echo("No item selected", err=True)
prompt_continue = click.confirm("Continue?")

View file

@ -10,17 +10,20 @@ This potentially allows it to be:
It should be free of external modules, importlib, exec, etc. etc.
'''
from __future__ import annotations
REQUIRES = 'REQUIRES'
NOT_HPI_MODULE_VAR = '__NOT_HPI_MODULE__'
###
import ast
import os
from typing import Optional, Sequence, List, NamedTuple, Iterable, cast, Any
from pathlib import Path
import re
import logging
import os
import re
from collections.abc import Iterable, Sequence
from pathlib import Path
from typing import Any, NamedTuple, Optional, cast
'''
None means that requirements weren't defined (different from empty requirements)
@ -30,11 +33,11 @@ Requires = Optional[Sequence[str]]
class HPIModule(NamedTuple):
name: str
skip_reason: Optional[str]
doc: Optional[str] = None
file: Optional[Path] = None
skip_reason: str | None
doc: str | None = None
file: Path | None = None
requires: Requires = None
legacy: Optional[str] = None # contains reason/deprecation warning
legacy: str | None = None # contains reason/deprecation warning
def ignored(m: str) -> bool:
@ -119,7 +122,7 @@ def _extract_requirements(a: ast.Module) -> Requires:
elif isinstance(c, ast.Str):
deps.append(c.s)
else:
raise RuntimeError(f"Expecting string contants only in {REQUIRES} declaration")
raise RuntimeError(f"Expecting string constants only in {REQUIRES} declaration")
return tuple(deps)
return None
@ -144,7 +147,7 @@ def all_modules() -> Iterable[HPIModule]:
def _iter_my_roots() -> Iterable[Path]:
import my # doesn't import any code, because of namespace package
paths: List[str] = list(my.__path__) # type: ignore[attr-defined]
paths: list[str] = list(my.__path__)
if len(paths) == 0:
# should probably never happen?, if this code is running, it was imported
# because something was added to __path__ to match this name
@ -242,7 +245,7 @@ def test_pure() -> None:
src = Path(__file__).read_text()
# 'import my' is allowed, but
# dont allow anything other HPI modules
assert re.findall('import ' + r'my\.\S+', src, re.M) == []
assert re.findall('import ' + r'my\.\S+', src, re.MULTILINE) == []
assert 'from ' + 'my' not in src

View file

@ -3,11 +3,22 @@ Various error handling helpers
See https://beepb00p.xyz/mypy-error-handling.html#kiss for more detail
"""
from __future__ import annotations
import traceback
from collections.abc import Iterable, Iterator
from datetime import datetime
from itertools import tee
from typing import Union, TypeVar, Iterable, List, Tuple, Type, Optional, Callable, Any, cast
from .compat import Literal
from typing import (
Any,
Callable,
Literal,
TypeVar,
Union,
cast,
)
from .types import Json
T = TypeVar('T')
E = TypeVar('E', bound=Exception) # TODO make covariant?
@ -18,7 +29,8 @@ Res = ResT[T, Exception]
ErrorPolicy = Literal["yield", "raise", "drop"]
def notnone(x: Optional[T]) -> T:
def notnone(x: T | None) -> T:
assert x is not None
return x
@ -26,16 +38,49 @@ def notnone(x: Optional[T]) -> T:
def unwrap(res: Res[T]) -> T:
if isinstance(res, Exception):
raise res
else:
return res
def drop_exceptions(itr: Iterator[Res[T]]) -> Iterator[T]:
"""Return non-errors from the iterable"""
for o in itr:
if isinstance(o, Exception):
continue
yield o
def raise_exceptions(itr: Iterable[Res[T]]) -> Iterator[T]:
"""Raise errors from the iterable, stops the select function"""
for o in itr:
if isinstance(o, Exception):
raise o
yield o
def warn_exceptions(itr: Iterable[Res[T]], warn_func: Callable[[Exception], None] | None = None) -> Iterator[T]:
# if not provided, use the 'warnings' module
if warn_func is None:
from my.core.warnings import medium
def _warn_func(e: Exception) -> None:
# TODO: print traceback? but user could always --raise-exceptions as well
medium(str(e))
warn_func = _warn_func
for o in itr:
if isinstance(o, Exception):
warn_func(o)
continue
yield o
def echain(ex: E, cause: Exception) -> E:
ex.__cause__ = cause
return ex
def split_errors(l: Iterable[ResT[T, E]], ET: Type[E]) -> Tuple[Iterable[T], Iterable[E]]:
def split_errors(l: Iterable[ResT[T, E]], ET: type[E]) -> tuple[Iterable[T], Iterable[E]]:
# TODO would be nice to have ET=Exception default? but it causes some mypy complaints?
vit, eit = tee(l)
# TODO ugh, not sure if I can reconcile type checking and runtime and convince mypy that ET and E are the same type?
@ -53,7 +98,9 @@ def split_errors(l: Iterable[ResT[T, E]], ET: Type[E]) -> Tuple[Iterable[T], Ite
K = TypeVar('K')
def sort_res_by(items: Iterable[Res[T]], key: Callable[[Any], K]) -> List[Res[T]]:
def sort_res_by(items: Iterable[Res[T]], key: Callable[[Any], K]) -> list[Res[T]]:
"""
Sort a sequence potentially interleaved with errors/entries on which the key can't be computed.
The general idea is: the error sticks to the non-error entry that follows it
@ -61,7 +108,7 @@ def sort_res_by(items: Iterable[Res[T]], key: Callable[[Any], K]) -> List[Res[T]
group = []
groups = []
for i in items:
k: Optional[K]
k: K | None
try:
k = key(i)
except Exception: # error white computing key? dunno, might be nice to handle...
@ -71,8 +118,8 @@ def sort_res_by(items: Iterable[Res[T]], key: Callable[[Any], K]) -> List[Res[T]
groups.append((k, group))
group = []
results: List[Res[T]] = []
for v, grp in sorted(groups, key=lambda p: p[0]): # type: ignore[return-value, arg-type] # TODO SupportsLessThan??
results: list[Res[T]] = []
for _v, grp in sorted(groups, key=lambda p: p[0]): # type: ignore[return-value, arg-type] # TODO SupportsLessThan??
results.extend(grp)
results.extend(group) # handle last group (it will always be errors only)
@ -94,7 +141,7 @@ def test_sort_res_by() -> None:
1,
Exc('last'),
]
results = sort_res_by(ress, lambda x: int(x)) # type: ignore
results = sort_res_by(ress, lambda x: int(x))
assert results == [
1,
'bad',
@ -106,32 +153,32 @@ def test_sort_res_by() -> None:
Exc('last'),
]
results2 = sort_res_by(ress + [0], lambda x: int(x)) # type: ignore
results2 = sort_res_by([*ress, 0], lambda x: int(x))
assert results2 == [Exc('last'), 0] + results[:-1]
assert sort_res_by(['caba', 'a', 'aba', 'daba'], key=lambda x: len(x)) == ['a', 'aba', 'caba', 'daba']
assert sort_res_by([], key=lambda x: x) == [] # type: ignore
assert sort_res_by([], key=lambda x: x) == []
# helpers to associate timestamps with the errors (so something meaningful could be displayed on the plots, for example)
# todo document it under 'patterns' somewhere...
# todo proper typevar?
from datetime import datetime
def set_error_datetime(e: Exception, dt: Optional[datetime]) -> None:
def set_error_datetime(e: Exception, dt: datetime | None) -> None:
if dt is None:
return
e.args = e.args + (dt,)
e.args = (*e.args, dt)
# todo not sure if should return new exception?
def attach_dt(e: Exception, *, dt: Optional[datetime]) -> Exception:
def attach_dt(e: Exception, *, dt: datetime | None) -> Exception:
set_error_datetime(e, dt)
return e
# todo it might be problematic because might mess with timezones (when it's converted to string, it's converted to a shift)
def extract_error_datetime(e: Exception) -> Optional[datetime]:
def extract_error_datetime(e: Exception) -> datetime | None:
import re
from datetime import datetime
for x in reversed(e.args):
if isinstance(x, datetime):
return x
@ -146,8 +193,6 @@ def extract_error_datetime(e: Exception) -> Optional[datetime]:
return None
import traceback
from .common import Json
def error_to_json(e: Exception) -> Json:
estr = ''.join(traceback.format_exception(Exception, e, e.__traceback__))
return {'error': estr}
@ -155,7 +200,13 @@ def error_to_json(e: Exception) -> Json:
MODULE_SETUP_URL = 'https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#private-configuration-myconfig'
def warn_my_config_import_error(err: Union[ImportError, AttributeError], help_url: Optional[str] = None) -> bool:
def warn_my_config_import_error(
err: ImportError | AttributeError,
*,
help_url: str | None = None,
module_name: str | None = None,
) -> bool:
"""
If the user tried to import something from my.config but it failed,
possibly due to missing the config block in my.config?
@ -163,10 +214,12 @@ def warn_my_config_import_error(err: Union[ImportError, AttributeError], help_ur
Returns True if it matched a possible config error
"""
import re
import click
if help_url is None:
help_url = MODULE_SETUP_URL
if type(err) == ImportError:
if type(err) is ImportError:
if err.name != 'my.config':
return False
# parse name that user attempted to import
@ -178,17 +231,31 @@ You may be missing the '{section_name}' section from your config.
See {help_url}\
""", fg='yellow', err=True)
return True
elif type(err) == AttributeError:
elif type(err) is AttributeError:
# test if user had a nested config block missing
# https://github.com/karlicoss/HPI/issues/223
if hasattr(err, 'obj') and hasattr(err, "name"):
config_obj = cast(object, getattr(err, 'obj')) # the object that caused the attribute error
# e.g. active_browser for my.browser
nested_block_name = err.name # type: ignore[attr-defined]
if config_obj.__module__ == 'my.config':
click.secho(f"""You're likely missing the nested config block for '{getattr(config_obj, '__name__', str(config_obj))}.{nested_block_name}'.
nested_block_name = err.name
errmsg = f"""You're likely missing the nested config block for '{getattr(config_obj, '__name__', str(config_obj))}.{nested_block_name}'.
See {help_url} or check the corresponding module.py file for an example\
""", fg='yellow', err=True)
"""
if config_obj.__module__ == 'my.config':
click.secho(errmsg, fg='yellow', err=True)
return True
if module_name is not None and nested_block_name == module_name.split('.')[-1]:
# this tries to cover cases like these
# user config:
# class location:
# class via_ip:
# accuracy = 10_000
# then when we import it, we do something like
# from my.config import location
# user_config = location.via_ip
# so if location is present, but via_ip is not, we get
# AttributeError: type object 'location' has no attribute 'via_ip'
click.secho(errmsg, fg='yellow', err=True)
return True
else:
click.echo(f"Unexpected error... {err}", err=True)
@ -196,7 +263,8 @@ See {help_url} or check the corresponding module.py file for an example\
def test_datetime_errors() -> None:
import pytz
import pytz # noqa: I001
dt_notz = datetime.now()
dt_tz = datetime.now(tz=pytz.timezone('Europe/Amsterdam'))
for dt in [dt_tz, dt_notz]:

66
my/core/experimental.py Normal file
View file

@ -0,0 +1,66 @@
from __future__ import annotations
import sys
import types
from typing import Any
# The idea behind this one is to support accessing "overlaid/shadowed" modules from namespace packages
# See usage examples here:
# - https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/util/hpi_heartbeat.py
# - https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/twitter/all.py
# Suppose you want to use my.twitter.talon, which isn't in the default all.py
# You could just copy all.py to your personal overlay, but that would mean duplicating
# all the code and possible upstream changes.
# Alternatively, you could import the "original" my.twitter.all module from "overlay" my.twitter.all
# _ORIG = import_original_module(__name__, __file__)
# this would magically take care of package import path etc,
# and should import the "original" my.twitter.all as _ORIG
# After that you can call its methods, extend etc.
def import_original_module(
module_name: str,
file: str,
*,
star: bool = False,
globals: dict[str, Any] | None = None,
) -> types.ModuleType:
module_to_restore = sys.modules[module_name]
# NOTE: we really wanna to hack the actual package of the module
# rather than just top level my.
# since that would be a bit less disruptive
module_pkg = module_to_restore.__package__
assert module_pkg is not None
parent = sys.modules[module_pkg]
my_path = parent.__path__._path # type: ignore[attr-defined]
my_path_orig = list(my_path)
def fixup_path() -> None:
for i, p in enumerate(my_path_orig):
starts = file.startswith(p)
if i == 0:
# not sure about this.. but I guess it'll always be 0th element?
assert starts, (my_path_orig, file)
if starts:
my_path.remove(p)
# should remove exactly one item
assert len(my_path) + 1 == len(my_path_orig), (my_path_orig, file)
try:
fixup_path()
try:
del sys.modules[module_name]
# NOTE: we're using __import__ instead of importlib.import_module
# since it's closer to the actual normal import (e.g. imports subpackages etc properly )
# fromlist=[None] forces it to return rightmost child
# (otherwise would just return 'my' package)
res = __import__(module_name, fromlist=[None]) # type: ignore[list-item]
if star:
assert globals is not None
globals.update({k: v for k, v in vars(res).items() if not k.startswith('_')})
return res
finally:
sys.modules[module_name] = module_to_restore
finally:
my_path[:] = my_path_orig

View file

@ -1,27 +1,29 @@
from .common import assert_subpackage; assert_subpackage(__name__)
from __future__ import annotations
import dataclasses as dcl
from .internal import assert_subpackage
assert_subpackage(__name__)
import dataclasses
import inspect
from typing import TypeVar, Type, Any
from typing import Any, Generic, TypeVar
D = TypeVar('D')
def _freeze_dataclass(Orig: Type[D]):
ofields = [(f.name, f.type, f) for f in dcl.fields(Orig)]
def _freeze_dataclass(Orig: type[D]):
ofields = [(f.name, f.type, f) for f in dataclasses.fields(Orig)] # type: ignore[arg-type] # see https://github.com/python/typing_extensions/issues/115
# extract properties along with their types
props = list(inspect.getmembers(Orig, lambda o: isinstance(o, property)))
pfields = [(name, inspect.signature(getattr(prop, 'fget')).return_annotation) for name, prop in props]
# FIXME not sure about name?
# NOTE: sadly passing bases=[Orig] won't work, python won't let us override properties with fields
RRR = dcl.make_dataclass('RRR', fields=[*ofields, *pfields])
RRR = dataclasses.make_dataclass('RRR', fields=[*ofields, *pfields])
# todo maybe even declare as slots?
return props, RRR
# todo need some decorator thingie?
from typing import Generic
class Freezer(Generic[D]):
'''
Some magic which converts dataclass properties into fields.
@ -29,13 +31,13 @@ class Freezer(Generic[D]):
For now only supports dataclasses.
'''
def __init__(self, Orig: Type[D]) -> None:
def __init__(self, Orig: type[D]) -> None:
self.Orig = Orig
self.props, self.Frozen = _freeze_dataclass(Orig)
def freeze(self, value: D) -> D:
pvalues = {name: getattr(value, name) for name, _ in self.props}
return self.Frozen(**dcl.asdict(value), **pvalues)
return self.Frozen(**dataclasses.asdict(value), **pvalues) # type: ignore[call-overload] # see https://github.com/python/typing_extensions/issues/115
### tests
@ -43,7 +45,7 @@ class Freezer(Generic[D]):
# this needs to be defined here to prevent a mypy bug
# see https://github.com/python/mypy/issues/7281
@dcl.dataclass
@dataclasses.dataclass
class _A:
x: Any
@ -58,8 +60,10 @@ class _A:
def test_freezer() -> None:
val = _A(x=dict(an_int=123, an_any=[1, 2, 3]))
val = _A(x={
'an_int': 123,
'an_any': [1, 2, 3],
})
af = Freezer(_A)
fval = af.freeze(val)
@ -67,6 +71,7 @@ def test_freezer() -> None:
assert fd['typed'] == 123
assert fd['untyped'] == [1, 2, 3]
###
# TODO shit. what to do with exceptions?

260
my/core/hpi_compat.py Normal file
View file

@ -0,0 +1,260 @@
"""
Contains various backwards compatibility/deprecation helpers relevant to HPI itself.
(as opposed to .compat module which implements compatibility between python versions)
"""
from __future__ import annotations
import inspect
import os
import re
from collections.abc import Iterator, Sequence
from types import ModuleType
from typing import TypeVar
from . import warnings
def handle_legacy_import(
parent_module_name: str,
legacy_submodule_name: str,
parent_module_path: list[str],
) -> bool:
###
# this is to trick mypy into treating this as a proper namespace package
# should only be used for backwards compatibility on packages that are convernted into namespace & all.py pattern
# - https://www.python.org/dev/peps/pep-0382/#namespace-packages-today
# - https://github.com/karlicoss/hpi_namespace_experiment
# - discussion here https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/269946944
from pkgutil import extend_path
parent_module_path[:] = extend_path(parent_module_path, parent_module_name)
# 'this' source tree ends up first in the pythonpath when we extend_path()
# so we need to move 'this' source tree towards the end to make sure we prioritize overlays
parent_module_path[:] = parent_module_path[1:] + parent_module_path[:1]
###
# allow stuff like 'import my.module.submodule' and such
imported_as_parent = False
# allow stuff like 'from my.module import submodule'
importing_submodule = False
# some hacky traceback to inspect the current stack
# to see if the user is using the old style of importing
for f in inspect.stack():
# seems that when a submodule is imported, at some point it'll call some internal import machinery
# with 'parent' set to the parent module
# if parent module is imported first (i.e. in case of deprecated usage), it won't be the case
args = inspect.getargvalues(f.frame)
if args.locals.get('parent') == parent_module_name:
imported_as_parent = True
# this we can only detect from the code I guess
line = '\n'.join(f.code_context or [])
if re.match(rf'from\s+{parent_module_name}\s+import\s+{legacy_submodule_name}', line):
importing_submodule = True
# click sets '_HPI_COMPLETE' env var when it's doing autocompletion
# otherwise, the warning will be printed every time you try to tab complete
autocompleting_module_cli = "_HPI_COMPLETE" in os.environ
is_legacy_import = not (imported_as_parent or importing_submodule)
if is_legacy_import and not autocompleting_module_cli:
warnings.high(
f'''\
importing {parent_module_name} is DEPRECATED! \
Instead, import from {parent_module_name}.{legacy_submodule_name} or {parent_module_name}.all \
See https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy for more info.
'''
)
return is_legacy_import
def pre_pip_dal_handler(
name: str,
e: ModuleNotFoundError,
cfg,
requires: Sequence[str] = (),
) -> ModuleType:
'''
https://github.com/karlicoss/HPI/issues/79
'''
if e.name != name:
# the module itself was imported, so the problem is with some dependencies
raise e
try:
dal = _get_dal(cfg, name)
warnings.high(
f'''
Specifying modules' dependencies in the config or in my/config/repos is deprecated!
Please install {' '.join(requires)} as PIP packages (see the corresponding README instructions).
'''.strip(),
stacklevel=2,
)
except ModuleNotFoundError:
dal = None
if dal is None:
# probably means there was nothing in the old config in the first place
# so we should raise the original exception
raise e
return dal
def _get_dal(cfg, module_name: str):
mpath = getattr(cfg, module_name, None)
if mpath is not None:
from .utils.imports import import_dir
return import_dir(mpath, '.dal')
else:
from importlib import import_module
return import_module(f'my.config.repos.{module_name}.dal')
V = TypeVar('V')
# named to be kinda consistent with more_itertools, e.g. more_itertools.always_iterable
class always_supports_sequence(Iterator[V]):
"""
Helper to make migration from Sequence/List to Iterable/Iterator type backwards compatible in runtime
"""
def __init__(self, it: Iterator[V]) -> None:
self._it = it
self._list: list[V] | None = None
self._lit: Iterator[V] | None = None
def __iter__(self) -> Iterator[V]: # noqa: PYI034
if self._list is not None:
self._lit = iter(self._list)
return self
def __next__(self) -> V:
if self._list is not None:
assert self._lit is not None
delegate = self._lit
else:
delegate = self._it
return next(delegate)
def __getattr__(self, name):
return getattr(self._it, name)
@property
def _aslist(self) -> list[V]:
if self._list is None:
qualname = getattr(self._it, '__qualname__', '<no qualname>') # defensive just in case
warnings.medium(f'Using {qualname} as list is deprecated. Migrate to iterative processing or call list() explicitly.')
self._list = list(self._it)
# this is necessary for list constructor to work correctly
# since it's __iter__ first, then tries to compute length and then starts iterating...
self._lit = iter(self._list)
return self._list
def __len__(self) -> int:
return len(self._aslist)
def __getitem__(self, i: int) -> V:
return self._aslist[i]
def test_always_supports_sequence_list_constructor() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
# list constructor is a bit special... it's trying to compute length if it's available to optimize memory allocation
# so, what's happening in this case is
# - sit.__iter__ is called
# - sit.__len__ is called
# - sit.__next__ is called
res = list(sit)
assert res == ['a', 'b', 'c']
assert exhausted == 1
res = list(sit)
assert res == ['a', 'b', 'c']
assert exhausted == 1 # this will iterate over 'cached' list now, so original generator is only exhausted once
def test_always_supports_sequence_indexing() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
assert len(sit) == 3
assert exhausted == 1
assert sit[2] == 'c'
assert sit[1] == 'b'
assert sit[0] == 'a'
assert exhausted == 1
# a few tests to make sure list-like operations are working..
assert list(sit) == ['a', 'b', 'c']
assert [x for x in sit] == ['a', 'b', 'c'] # noqa: C416
assert list(sit) == ['a', 'b', 'c']
assert [x for x in sit] == ['a', 'b', 'c'] # noqa: C416
assert exhausted == 1
def test_always_supports_sequence_next() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
x = next(sit)
assert x == 'a'
assert exhausted == 0
x = next(sit)
assert x == 'b'
assert exhausted == 0
def test_always_supports_sequence_iter() -> None:
exhausted = 0
def it() -> Iterator[str]:
nonlocal exhausted
yield from ['a', 'b', 'c']
exhausted += 1
sit = always_supports_sequence(it())
for x in sit:
assert x == 'a'
break
x = next(sit)
assert x == 'b'
assert exhausted == 0
x = next(sit)
assert x == 'c'
assert exhausted == 0
for _ in sit:
raise RuntimeError # shouldn't trigger, just exhaust the iterator
assert exhausted == 1

View file

@ -1,14 +1,22 @@
'''
TODO doesn't really belong to 'core' morally, but can think of moving out later
'''
from .common import assert_subpackage; assert_subpackage(__name__)
from typing import Iterable, Any, Optional, Dict
from __future__ import annotations
from .common import LazyLogger, asdict, Json
from .internal import assert_subpackage
assert_subpackage(__name__)
logger = LazyLogger(__name__)
from collections.abc import Iterable
from typing import Any
import click
from .logging import make_logger
from .types import Json, asdict
logger = make_logger(__name__)
class config:
@ -27,6 +35,7 @@ def fill(it: Iterable[Any], *, measurement: str, reset: bool=RESET_DEFAULT, dt_c
db = config.db
from influxdb import InfluxDBClient # type: ignore
client = InfluxDBClient()
# todo maybe create if not exists?
# client.create_database(db)
@ -37,7 +46,7 @@ def fill(it: Iterable[Any], *, measurement: str, reset: bool=RESET_DEFAULT, dt_c
client.delete_series(database=db, measurement=measurement)
# TODO need to take schema here...
cache: Dict[str, bool] = {}
cache: dict[str, bool] = {}
def good(f, v) -> bool:
c = cache.get(f)
@ -56,7 +65,7 @@ def fill(it: Iterable[Any], *, measurement: str, reset: bool=RESET_DEFAULT, dt_c
def dit() -> Iterable[Json]:
for i in it:
d = asdict(i)
tags: Optional[Json] = None
tags: Json | None = None
tags_ = d.get('tags') # meh... handle in a more robust manner
if tags_ is not None and isinstance(tags_, dict): # FIXME meh.
del d['tags']
@ -69,18 +78,19 @@ def fill(it: Iterable[Any], *, measurement: str, reset: bool=RESET_DEFAULT, dt_c
fields = filter_dict(d)
yield dict(
measurement=measurement,
yield {
'measurement': measurement,
# TODO maybe good idea to tag with database file/name? to inspect inconsistencies etc..
# hmm, so tags are autoindexed and might be faster?
# not sure what's the big difference though
# "fields are data and tags are metadata"
tags=tags,
time=dt,
fields=fields,
)
'tags': tags,
'time': dt,
'fields': fields,
}
from more_itertools import chunked
# "The optimal batch size is 5000 lines of line protocol."
# some chunking is def necessary, otherwise it fails
inserted = 0
@ -94,7 +104,7 @@ def fill(it: Iterable[Any], *, measurement: str, reset: bool=RESET_DEFAULT, dt_c
# todo "Specify timestamp precision when writing to InfluxDB."?
def magic_fill(it, *, name: Optional[str]=None, reset: bool=RESET_DEFAULT) -> None:
def magic_fill(it, *, name: str | None = None, reset: bool = RESET_DEFAULT) -> None:
if name is None:
assert callable(it) # generators have no name/module
name = f'{it.__module__}:{it.__name__}'
@ -104,7 +114,9 @@ def magic_fill(it, *, name: Optional[str]=None, reset: bool=RESET_DEFAULT) -> No
it = it()
from itertools import tee
from more_itertools import first, one
it, x = tee(it)
f = first(x, default=None)
if f is None:
@ -114,17 +126,17 @@ def magic_fill(it, *, name: Optional[str]=None, reset: bool=RESET_DEFAULT) -> No
# TODO can we reuse pandas code or something?
#
from .pandas import _as_columns
schema = _as_columns(type(f))
from datetime import datetime
dtex = RuntimeError(f'expected single datetime field. schema: {schema}')
dtf = one((f for f, t in schema.items() if t == datetime), too_short=dtex, too_long=dtex)
fill(it, measurement=name, reset=reset, dt_col=dtf)
import click
@click.group()
def main() -> None:
pass
@ -133,8 +145,9 @@ def main() -> None:
@main.command(name='populate', short_help='populate influxdb')
@click.option('--reset', is_flag=True, help='Reset Influx measurements before inserting', show_default=True)
@click.argument('FUNCTION_NAME', type=str, required=True)
def populate(function_name: str, reset: bool) -> None:
def populate(*, function_name: str, reset: bool) -> None:
from .__main__ import _locate_functions_or_prompt
[provider] = list(_locate_functions_or_prompt([function_name]))
# todo could have a non-interactive version which populates from all data sources for the provider?
magic_fill(provider, reset=reset)

View file

@ -1,7 +1,8 @@
'''
A hook to insert user's config directory into Python's search path.
Note that this file is imported only if we don't have custom user config (under my.config namespace) in PYTHONPATH
Ideally that would be in __init__.py (so it's executed without having to import explicityly)
Ideally that would be in __init__.py (so it's executed without having to import explicitly)
But, with namespace packages, we can't have __init__.py in the parent subpackage
(see http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html#the-init-py-trap)
@ -15,15 +16,17 @@ Please let me know if you are aware of a better way of dealing with this!
def setup_config() -> None:
import sys
import warnings
from pathlib import Path
from .preinit import get_mycfg_dir
mycfg_dir = get_mycfg_dir()
if not mycfg_dir.exists():
warnings.warn(f"""
'my.config' package isn't found! (expected at '{mycfg_dir}'). This is likely to result in issues.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""".strip())
""".strip(), stacklevel=1)
return
mpath = str(mycfg_dir)
@ -41,11 +44,29 @@ See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-mo
except ImportError as ex:
# just in case... who knows what crazy setup users have
import logging
logging.exception(ex)
warnings.warn(f"""
Importing 'my.config' failed! (error: {ex}). This is likely to result in issues.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""")
""", stacklevel=1)
else:
# defensive just in case -- __file__ may not be present if there is some dynamic magic involved
used_config_file = getattr(my.config, '__file__', None)
if used_config_file is not None:
used_config_path = Path(used_config_file)
try:
# will crash if it's imported from other dir?
used_config_path.relative_to(mycfg_dir)
except ValueError:
# TODO maybe implement a strict mode where these warnings will be errors?
warnings.warn(
f"""
Expected my.config to be located at {mycfg_dir}, but instead its path is {used_config_path}.
This will likely cause issues down the line -- double check {mycfg_dir} structure.
See https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#setting-up-the-modules for more info.
""", stacklevel=1
)
setup_config()

9
my/core/internal.py Normal file
View file

@ -0,0 +1,9 @@
"""
Utils specific to hpi core, shouldn't really be used by HPI modules
"""
def assert_subpackage(name: str) -> None:
# can lead to some unexpected issues if you 'import cachew' which being in my/core directory.. so let's protect against it
# NOTE: if we use overlay, name can be smth like my.origg.my.core.cachew ...
assert name == '__main__' or 'my.core' in name, f'Expected module __name__ ({name}) to be __main__ or start with my.core'

View file

@ -1,224 +1,17 @@
"""
Various helpers for compression
"""
from __future__ import annotations
from .internal import assert_subpackage
from datetime import datetime
import pathlib
from pathlib import Path
import sys
from typing import Union, IO, Sequence, Any, Iterator
import io
assert_subpackage(__name__)
PathIsh = Union[Path, str]
from . import warnings
# do this later -- for now need to transition modules to avoid using kompress directly (e.g. ZipPath)
# warnings.high('my.core.kompress is deprecated, please use "kompress" library directly. See https://github.com/karlicoss/kompress')
class Ext:
xz = '.xz'
zip = '.zip'
lz4 = '.lz4'
zstd = '.zstd'
targz = '.tar.gz'
def is_compressed(p: Path) -> bool:
# todo kinda lame way for now.. use mime ideally?
# should cooperate with kompress.kopen?
return any(p.name.endswith(ext) for ext in {Ext.xz, Ext.zip, Ext.lz4, Ext.zstd, Ext.targz})
def _zstd_open(path: Path, *args, **kwargs) -> IO[str]:
import zstandard as zstd # type: ignore
fh = path.open('rb')
dctx = zstd.ZstdDecompressor()
reader = dctx.stream_reader(fh)
return io.TextIOWrapper(reader, **kwargs) # meh
# TODO returns protocol that we can call 'read' against?
# TODO use the 'dependent type' trick?
def kopen(path: PathIsh, *args, mode: str='rt', **kwargs) -> IO[str]:
# TODO handle mode in *rags?
encoding = kwargs.get('encoding', 'utf8')
kwargs['encoding'] = encoding
pp = Path(path)
name = pp.name
if name.endswith(Ext.xz):
import lzma
r = lzma.open(pp, mode, *args, **kwargs)
# should only happen for binary mode?
# file:///usr/share/doc/python3/html/library/lzma.html?highlight=lzma#lzma.open
assert not isinstance(r, lzma.LZMAFile), r
return r
elif name.endswith(Ext.zip):
# eh. this behaviour is a bit dodgy...
from zipfile import ZipFile
zfile = ZipFile(pp)
[subpath] = args # meh?
## oh god... https://stackoverflow.com/a/5639960/706389
ifile = zfile.open(subpath, mode='r')
ifile.readable = lambda: True # type: ignore
ifile.writable = lambda: False # type: ignore
ifile.seekable = lambda: False # type: ignore
ifile.read1 = ifile.read # type: ignore
# TODO pass all kwargs here??
# todo 'expected "BinaryIO"'??
return io.TextIOWrapper(ifile, encoding=encoding) # type: ignore[arg-type]
elif name.endswith(Ext.lz4):
import lz4.frame # type: ignore
return lz4.frame.open(str(pp), mode, *args, **kwargs)
elif name.endswith(Ext.zstd):
return _zstd_open(pp, mode, *args, **kwargs)
elif name.endswith(Ext.targz):
import tarfile
# FIXME pass mode?
tf = tarfile.open(pp)
# TODO pass encoding?
x = tf.extractfile(*args); assert x is not None
return x # type: ignore[return-value]
else:
return pp.open(mode, *args, **kwargs)
import typing
import os
if typing.TYPE_CHECKING:
# otherwise mypy can't figure out that BasePath is a type alias..
BasePath = pathlib.Path
else:
BasePath = pathlib.WindowsPath if os.name == 'nt' else pathlib.PosixPath
class CPath(BasePath):
"""
Hacky way to support compressed files.
If you can think of a better way to do this, please let me know! https://github.com/karlicoss/HPI/issues/20
Ugh. So, can't override Path because of some _flavour thing.
Path only has _accessor and _closed slots, so can't directly set .open method
_accessor.open has to return file descriptor, doesn't work for compressed stuff.
"""
def open(self, *args, **kwargs):
# TODO assert read only?
return kopen(str(self))
open = kopen # TODO deprecate
# meh
# TODO ideally switch to ZipPath or smth similar?
# nothing else supports subpath properly anyway
def kexists(path: PathIsh, subpath: str) -> bool:
try:
kopen(path, subpath)
return True
except Exception:
return False
import zipfile
if sys.version_info[:2] >= (3, 8):
# meh... zipfile.Path is not available on 3.7
zipfile_Path = zipfile.Path
from kompress import *
except ModuleNotFoundError as e:
if e.name == 'kompress':
warnings.high('Please install kompress (pip3 install kompress). Falling onto vendorized kompress for now.')
from ._deprecated.kompress import * # type: ignore[assignment]
else:
if typing.TYPE_CHECKING:
zipfile_Path = Any
else:
zipfile_Path = object
class ZipPath(zipfile_Path):
# NOTE: is_dir/is_file might not behave as expected, the base class checks it only based on the slash in path
# seems that root/at are not exposed in the docs, so might be an implementation detail
root: zipfile.ZipFile
at: str
@property
def filepath(self) -> Path:
res = self.root.filename
assert res is not None # make mypy happy
return Path(res)
@property
def subpath(self) -> Path:
return Path(self.at)
def absolute(self) -> ZipPath:
return ZipPath(self.filepath.absolute(), self.at)
def exists(self) -> bool:
if self.at == '':
# special case, the base class returns False in this case for some reason
return self.filepath.exists()
return super().exists() or self._as_dir().exists()
def _as_dir(self) -> zipfile_Path:
# note: seems that zip always uses forward slash, regardless OS?
return zipfile_Path(self.root, self.at + '/')
def rglob(self, glob: str) -> Sequence[ZipPath]:
# note: not 100% sure about the correctness, but seem fine?
# Path.match() matches from the right, so need to
rpaths = [p for p in self.root.namelist() if p.startswith(self.at)]
rpaths = [p for p in rpaths if Path(p).match(glob)]
return [ZipPath(self.root, p) for p in rpaths]
def relative_to(self, other: ZipPath) -> Path:
assert self.filepath == other.filepath, (self.filepath, other.filepath)
return self.subpath.relative_to(other.subpath)
@property
def parts(self) -> Sequence[str]:
# messy, but might be ok..
return self.filepath.parts + self.subpath.parts
def __truediv__(self, key) -> ZipPath:
# need to implement it so the return type is not zipfile.Path
tmp = zipfile_Path(self.root) / self.at / key
return ZipPath(self.root, tmp.at) # type: ignore[attr-defined]
def iterdir(self) -> Iterator[ZipPath]:
for s in self._as_dir().iterdir():
yield ZipPath(s.root, s.at) # type: ignore[attr-defined]
@property
def stem(self) -> str:
return self.subpath.stem
@property # type: ignore[misc]
def __class__(self):
return Path
def __eq__(self, other) -> bool:
# hmm, super class doesn't seem to treat as equals unless they are the same object
if not isinstance(other, ZipPath):
return False
return (self.filepath, self.subpath) == (other.filepath, other.subpath)
def __hash__(self) -> int:
return hash((self.filepath, self.subpath))
def stat(self) -> os.stat_result:
# NOTE: zip datetimes have no notion of time zone, usually they just keep local time?
# see https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure
dt = datetime(*self.root.getinfo(self.at).date_time)
ts = int(dt.timestamp())
params = dict(
st_mode=0,
st_ino=0,
st_dev=0,
st_nlink=1,
st_uid=1000,
st_gid=1000,
st_size=0, # todo compute it properly?
st_atime=ts,
st_mtime=ts,
st_ctime=ts,
)
return os.stat_result(tuple(params.values()))
raise e

View file

@ -5,21 +5,25 @@ This can potentially allow both for safer defensive parsing, and let you know if
TODO perhaps need to get some inspiration from linear logic to decide on a nice API...
'''
from __future__ import annotations
from collections import OrderedDict
from typing import Any, List
from typing import Any
def ignore(w, *keys):
for k in keys:
w[k].ignore()
def zoom(w, *keys):
return [w[k].zoom() for k in keys]
# TODO need to support lists
class Zoomable:
def __init__(self, parent, *args, **kwargs) -> None:
super().__init__(*args, **kwargs) # type: ignore
super().__init__(*args, **kwargs)
self.parent = parent
# TODO not sure, maybe do it via del??
@ -40,7 +44,7 @@ class Zoomable:
assert self.parent is not None
self.parent._remove(self)
def zoom(self) -> 'Zoomable':
def zoom(self) -> Zoomable:
self.consume()
return self
@ -63,6 +67,7 @@ class Wdict(Zoomable, OrderedDict):
def this_consumed(self):
return len(self) == 0
# TODO specify mypy type for the index special method?
@ -77,6 +82,7 @@ class Wlist(Zoomable, list):
def this_consumed(self):
return len(self) == 0
class Wvalue(Zoomable):
def __init__(self, parent, value: Any) -> None:
super().__init__(parent)
@ -93,10 +99,9 @@ class Wvalue(Zoomable):
return 'WValue{' + repr(self.value) + '}'
from typing import Tuple
def _wrap(j, parent=None) -> Tuple[Zoomable, List[Zoomable]]:
def _wrap(j, parent=None) -> tuple[Zoomable, list[Zoomable]]:
res: Zoomable
cc: List[Zoomable]
cc: list[Zoomable]
if isinstance(j, dict):
res = Wdict(parent)
cc = [res]
@ -120,15 +125,17 @@ def _wrap(j, parent=None) -> Tuple[Zoomable, List[Zoomable]]:
raise RuntimeError(f'Unexpected type: {type(j)} {j}')
from collections.abc import Iterator
from contextlib import contextmanager
from typing import Iterator
class UnconsumedError(Exception):
pass
# TODO think about error policy later...
@contextmanager
def wrap(j, throw=True) -> Iterator[Zoomable]:
def wrap(j, *, throw=True) -> Iterator[Zoomable]:
w, children = _wrap(j)
yield w
@ -146,8 +153,11 @@ Expected {c} to be fully consumed by the parser.
from typing import cast
def test_unconsumed() -> None:
import pytest # type: ignore
import pytest
with pytest.raises(UnconsumedError):
with wrap({'a': 1234}) as w:
w = cast(Wdict, w)
@ -158,6 +168,7 @@ def test_unconsumed() -> None:
w = cast(Wdict, w)
d = w['c']['d'].zoom()
def test_consumed() -> None:
with wrap({'a': 1234}) as w:
w = cast(Wdict, w)
@ -168,6 +179,7 @@ def test_consumed() -> None:
c = w['c'].zoom()
d = c['d'].zoom()
def test_types() -> None:
# (string, number, object, array, boolean or nul
with wrap({'string': 'string', 'number': 3.14, 'boolean': True, 'null': None, 'list': [1, 2, 3]}) as w:
@ -179,6 +191,7 @@ def test_types() -> None:
for x in list(w['list'].zoom()): # TODO eh. how to avoid the extra list thing?
x.consume()
def test_consume_all() -> None:
with wrap({'aaa': {'bbb': {'hi': 123}}}) as w:
w = cast(Wdict, w)
@ -188,11 +201,9 @@ def test_consume_all() -> None:
def test_consume_few() -> None:
import pytest
pytest.skip('Will think about it later..')
with wrap({
'important': 123,
'unimportant': 'whatever'
}) as w:
with wrap({'important': 123, 'unimportant': 'whatever'}) as w:
w = cast(Wdict, w)
w['important'].zoom()
w.consume_all()
@ -200,7 +211,8 @@ def test_consume_few() -> None:
def test_zoom() -> None:
import pytest # type: ignore
import pytest
with wrap({'aaa': 'whatever'}) as w:
w = cast(Wdict, w)
with pytest.raises(KeyError):
@ -209,3 +221,34 @@ def test_zoom() -> None:
# TODO type check this...
# TODO feels like the whole thing kind of unnecessarily complex
# - cons:
# - in most cases this is not even needed? who cares if we miss a few attributes?
# - pro: on the other hand it could be interesting to know about new attributes in data,
# and without this kind of processing we wouldn't even know
# alternatives
# - manually process data
# e.g. use asserts, dict.pop and dict.values() methods to unpack things
# - pros:
# - very simple, since uses built in syntax
# - very performant, as fast as it gets
# - very flexible, easy to adjust behaviour
# - cons:
# - can forget to assert about extra entities etc, so error prone
# - if we do something like =assert j.pop('status') == 200, j=, by the time assert happens we already popped item -- makes error handling harder
# - a bit verbose.. so probably requires some helper functions though (could be much leaner than current konsume though)
# - if we assert, then terminates parsing too early, if we're defensive then inflates the code a lot with if statements
# - TODO perhaps combine warnings somehow or at least only emit once per module?
# - hmm actually tbh if we carefully go through everything and don't make copies, then only requires one assert at the very end?
# - TODO this is kinda useful? https://discuss.python.org/t/syntax-for-dictionnary-unpacking-to-variables/18718
# operator.itemgetter?
# - TODO can use match operator in python for this? quite nice actually! and allows for dynamic behaviour
# only from 3.10 tho, and gonna be tricky to do dynamic defensive behaviour with this
# - TODO in a sense, blenser already would hint if some meaningful fields aren't being processed? only if they are changing though
# - define a "schema" for data, then just recursively match data against the schema?
# possibly pydantic already does something like that? not sure about performance though
# pros:
# - much simpler to extend and understand what's going on
# cons:
# - more rigid, so it becomes tricky to do dynamic stuff (e.g. if schema actually changes)

View file

@ -1,55 +0,0 @@
# I think 'compat' should be for python-specific compat stuff, whereas this for HPI specific backwards compatibility
import inspect
import re
from typing import List
from my.core import warnings as W
def handle_legacy_import(
parent_module_name: str,
legacy_submodule_name: str,
parent_module_path: List[str],
) -> bool:
###
# this is to trick mypy into treating this as a proper namespace package
# should only be used for backwards compatibility on packages that are convernted into namespace & all.py pattern
# - https://www.python.org/dev/peps/pep-0382/#namespace-packages-today
# - https://github.com/karlicoss/hpi_namespace_experiment
# - discussion here https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/269946944
from pkgutil import extend_path
parent_module_path[:] = extend_path(parent_module_path, parent_module_name)
# 'this' source tree ends up first in the pythonpath when we extend_path()
# so we need to move 'this' source tree towards the end to make sure we prioritize overlays
parent_module_path[:] = parent_module_path[1:] + parent_module_path[:1]
###
# allow stuff like 'import my.module.submodule' and such
imported_as_parent = False
# allow stuff like 'from my.module import submodule'
importing_submodule = False
# some hacky traceback to inspect the current stack
# to see if the user is using the old style of importing
for f in inspect.stack():
# seems that when a submodule is imported, at some point it'll call some internal import machinery
# with 'parent' set to the parent module
# if parent module is imported first (i.e. in case of deprecated usage), it won't be the case
args = inspect.getargvalues(f.frame)
if args.locals.get('parent') == parent_module_name:
imported_as_parent = True
# this we can only detect from the code I guess
line = '\n'.join(f.code_context or [])
if re.match(rf'from\s+{parent_module_name}\s+import\s+{legacy_submodule_name}', line):
importing_submodule = True
is_legacy_import = not (imported_as_parent or importing_submodule)
if is_legacy_import:
W.high(f'''\
importing {parent_module_name} is DEPRECATED! \
Instead, import from {parent_module_name}.{legacy_submodule_name} or {parent_module_name}.all \
See https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy for more info.
''')
return is_legacy_import

View file

@ -1,49 +1,61 @@
#!/usr/bin/env python3
'''
Default logger is a bit meh, see 'test'/run this file for a demo
TODO name 'klogging' to avoid possible conflict with default 'logging' module
TODO shit. too late already? maybe use fallback & deprecate
'''
from __future__ import annotations
import logging
import os
import sys
import warnings
from functools import lru_cache
from typing import TYPE_CHECKING, Union
def test() -> None:
from typing import Callable
import logging
import sys
M: Callable[[str], None] = lambda s: print(s, file=sys.stderr)
M(" Logging module's defaults are not great...'")
l = logging.getLogger('test_logger')
# todo why is mypy unhappy about these???
## prepare exception for later
try:
None.whatever # type: ignore[attr-defined] # noqa: B018
except Exception as e:
ex = e
##
M(" Logging module's defaults are not great:")
l = logging.getLogger('default_logger')
l.error("For example, this should be logged as error. But it's not even formatted properly, doesn't have logger name or level")
M(" The reason is that you need to remember to call basicConfig() first")
M("\n The reason is that you need to remember to call basicConfig() first. Let's do it now:")
logging.basicConfig()
l.error("OK, this is better. But the default format kinda sucks, I prefer having timestamps and the file/line number")
M("")
M(" With LazyLogger you get a reasonable logging format, colours and other neat things")
M("\n Also exception logging is kinda lame, doesn't print traceback by default unless you remember to pass exc_info:")
l.exception(ex) # type: ignore[possibly-undefined]
ll = LazyLogger('test') # No need for basicConfig!
M("\n\n With make_logger you get a reasonable logging format, colours (via colorlog library) and other neat things:")
ll = make_logger('test') # No need for basicConfig!
ll.info("default level is INFO")
ll.debug(".. so this shouldn't be displayed")
ll.debug("... so this shouldn't be displayed")
ll.warning("warnings are easy to spot!")
ll.exception(RuntimeError("exceptions as well"))
M("\n Exceptions print traceback by default now:")
ll.exception(ex)
M("\n You can (and should) use it via regular logging.getLogger after that, e.g. let's set logging level to DEBUG now")
logging.getLogger('test').setLevel(logging.DEBUG)
ll.debug("... now debug messages are also displayed")
import logging
from typing import Union, Optional
import os
DEFAULT_LEVEL = 'INFO'
FORMAT = '{start}[%(levelname)-7s %(asctime)s %(name)s %(filename)s:%(lineno)-4d]{end} %(message)s'
FORMAT_NOCOLOR = FORMAT.format(start='', end='')
Level = int
LevelIsh = Optional[Union[Level, str]]
LevelIsh = Union[Level, str, None]
def mklevel(level: LevelIsh) -> Level:
# todo put in some global file, like envvars.py
glevel = os.environ.get('HPI_LOGS', None)
if glevel is not None:
level = glevel
if level is None:
return logging.NOTSET
if isinstance(level, int):
@ -51,48 +63,204 @@ def mklevel(level: LevelIsh) -> Level:
return getattr(logging, level.upper())
FORMAT = '{start}[%(levelname)-7s %(asctime)s %(name)s %(filename)s:%(lineno)d]{end} %(message)s'
FORMAT_COLOR = FORMAT.format(start='%(color)s', end='%(end_color)s')
FORMAT_NOCOLOR = FORMAT.format(start='', end='')
DATEFMT = '%Y-%m-%d %H:%M:%S'
def get_collapse_level() -> Level | None:
# TODO not sure if should be specific to logger name?
cl = os.environ.get('LOGGING_COLLAPSE', None)
if cl is not None:
return mklevel(cl)
# legacy name, maybe deprecate?
cl = os.environ.get('COLLAPSE_DEBUG_LOGS', None)
if cl is not None:
return logging.DEBUG
return None
def setup_logger(logger: logging.Logger, level: LevelIsh) -> None:
lvl = mklevel(level)
try:
import logzero # type: ignore[import]
except ModuleNotFoundError:
import warnings
def get_env_level(name: str) -> Level | None:
PREFIX = 'LOGGING_LEVEL_' # e.g. LOGGING_LEVEL_my_hypothesis=debug
# shell doesn't allow using dots in var names without escaping, so also support underscore syntax
lvl = os.environ.get(PREFIX + name, None) or os.environ.get(PREFIX + name.replace('.', '_'), None)
if lvl is not None:
return mklevel(lvl)
# if LOGGING_LEVEL_HPI is set, use that. This should override anything the module may set as its default
# this is also set when the user passes the --debug flag in the CLI
#
# check after LOGGING_LEVEL_ prefix since that is more specific
if 'LOGGING_LEVEL_HPI' in os.environ:
return mklevel(os.environ['LOGGING_LEVEL_HPI'])
# legacy name, for backwards compatibility
if 'HPI_LOGS' in os.environ:
from my.core.warnings import medium
warnings.warn("You might want to install 'logzero' for nice colored logs!")
logger.setLevel(lvl)
h = logging.StreamHandler()
h.setLevel(lvl)
h.setFormatter(logging.Formatter(fmt=FORMAT_NOCOLOR, datefmt=DATEFMT))
logger.addHandler(h)
logger.propagate = False # ugh. otherwise it duplicates log messages? not sure about it..
medium('The HPI_LOGS environment variable is deprecated, use LOGGING_LEVEL_HPI instead')
return mklevel(os.environ['HPI_LOGS'])
return None
def setup_logger(logger: str | logging.Logger, *, level: LevelIsh = None) -> None:
"""
Wrapper to simplify logging setup.
"""
if isinstance(logger, str):
logger = logging.getLogger(logger)
if level is None:
level = DEFAULT_LEVEL
# env level always takes precedence
env_level = get_env_level(logger.name)
if env_level is not None:
lvl = env_level
else:
formatter = logzero.LogFormatter(
fmt=FORMAT_COLOR,
datefmt=DATEFMT,
)
logzero.setup_logger(logger.name, level=lvl, formatter=formatter)
lvl = mklevel(level)
if logger.level == logging.NOTSET:
# if it's already set, the user requested a different logging level, let's respect that
logger.setLevel(lvl)
_setup_handlers_and_formatters(name=logger.name)
class LazyLogger(logging.Logger):
def __new__(cls, name: str, level: LevelIsh = 'INFO') -> 'LazyLogger':
# cached since this should only be done once per logger instance
@lru_cache(None)
def _setup_handlers_and_formatters(name: str) -> None:
logger = logging.getLogger(name)
# this is called prior to all _log calls so makes sense to do it here?
def isEnabledFor_lazyinit(*args, logger=logger, orig=logger.isEnabledFor, **kwargs):
att = 'lazylogger_init_done'
if not getattr(logger, att, False): # init once, if necessary
setup_logger(logger, level=level)
setattr(logger, att, True)
return orig(*args, **kwargs)
logger.isEnabledFor = isEnabledFor_lazyinit # type: ignore[assignment]
return logger # type: ignore[return-value]
logger.addFilter(AddExceptionTraceback())
collapse_level = get_collapse_level()
if collapse_level is None or not sys.stderr.isatty():
handler = logging.StreamHandler()
else:
handler = CollapseLogsHandler(maxlevel=collapse_level)
# default level for handler is NOTSET, which will make it process all messages
# we rely on the logger to actually accept/reject log msgs
logger.addHandler(handler)
# this attribute is set to True by default, which causes log entries to be passed to root logger (e.g. if you call basicConfig beforehand)
# even if log entry is handled by this logger ... not sure what's the point of this behaviour??
logger.propagate = False
try:
# try colorlog first, so user gets nice colored logs
import colorlog
except ModuleNotFoundError:
warnings.warn("You might want to 'pip install colorlog' for nice colored logs", stacklevel=1)
formatter = logging.Formatter(FORMAT_NOCOLOR)
else:
# log_color/reset are specific to colorlog
FORMAT_COLOR = FORMAT.format(start='%(log_color)s', end='%(reset)s')
# colorlog should detect tty in principle, but doesn't handle everything for some reason
# see https://github.com/borntyping/python-colorlog/issues/71
if handler.stream.isatty():
formatter = colorlog.ColoredFormatter(FORMAT_COLOR)
else:
formatter = logging.Formatter(FORMAT_NOCOLOR)
handler.setFormatter(formatter)
# by default, logging.exception isn't logging traceback unless called inside of the exception handler
# which is a bit annoying since we have to pass exc_info explicitly
# also see https://stackoverflow.com/questions/75121925/why-doesnt-python-logging-exception-method-log-traceback-by-default
# todo also amend by post about defensive error handling?
class AddExceptionTraceback(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
if record.levelname == 'ERROR':
exc = record.msg
if isinstance(exc, BaseException):
if record.exc_info is None or record.exc_info == (None, None, None):
exc_info = (type(exc), exc, exc.__traceback__)
record.exc_info = exc_info
return True
# todo also save full log in a file?
class CollapseLogsHandler(logging.StreamHandler):
'''
Collapses subsequent debug log lines and redraws on the same line.
Hopefully this gives both a sense of progress and doesn't clutter the terminal as much?
'''
last: bool = False
maxlevel: Level = logging.DEBUG # everything with less or equal level will be collapsed
def __init__(self, *args, maxlevel: Level, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.maxlevel = maxlevel
def emit(self, record: logging.LogRecord) -> None:
try:
msg = self.format(record)
cur = record.levelno <= self.maxlevel and '\n' not in msg
if cur:
if self.last:
self.stream.write('\033[K' + '\r') # clear line + return carriage
else:
if self.last:
self.stream.write('\n') # clean up after the last line
self.last = cur
columns, _ = os.get_terminal_size(0)
# ugh. the columns thing is meh. dunno I guess ultimately need curses for that
# TODO also would be cool to have a terminal post-processor? kinda like tail but aware of logging keywords (INFO/DEBUG/etc)
self.stream.write(msg + ' ' * max(0, columns - len(msg)) + ('' if cur else '\n'))
self.flush()
except:
self.handleError(record)
def make_logger(name: str, *, level: LevelIsh = None) -> logging.Logger:
logger = logging.getLogger(name)
setup_logger(logger, level=level)
return logger
# ughh. hacky way to have a single enlighten instance per interpreter, so it can be shared between modules
# not sure about this. I guess this should definitely be behind some flag
# OK, when stdout is not a tty, enlighten doesn't log anything, good
def get_enlighten():
# TODO could add env variable to disable enlighten for a module?
from unittest.mock import (
Mock, # Mock to return stub so cients don't have to think about it
)
# for now hidden behind the flag since it's a little experimental
if os.environ.get('ENLIGHTEN_ENABLE', None) is None:
return Mock()
try:
import enlighten # type: ignore[import-untyped]
except ModuleNotFoundError:
warnings.warn("You might want to 'pip install enlighten' for a nice progress bar", stacklevel=1)
return Mock()
# dirty, but otherwise a bit unclear how to share enlighten manager between packages that call each other
instance = getattr(enlighten, 'INSTANCE', None)
if instance is not None:
return instance
instance = enlighten.get_manager()
setattr(enlighten, 'INSTANCE', instance)
return instance
if __name__ == '__main__':
test()
## legacy/deprecated methods for backwards compatibility
if not TYPE_CHECKING:
from .compat import deprecated
@deprecated('use make_logger instead')
def LazyLogger(*args, **kwargs):
return make_logger(*args, **kwargs)
@deprecated('use make_logger instead')
def logger(*args, **kwargs):
return make_logger(*args, **kwargs)
##

37
my/core/mime.py Normal file
View file

@ -0,0 +1,37 @@
"""
Utils for mime/filetype handling
"""
from __future__ import annotations
from .internal import assert_subpackage
assert_subpackage(__name__)
import functools
from pathlib import Path
@functools.lru_cache(1)
def _magic():
import magic # type: ignore
# TODO also has uncompess=True? could be useful
return magic.Magic(mime=True)
# TODO could reuse in pdf module?
import mimetypes # todo do I need init()?
# todo wtf? fastermime thinks it's mime is application/json even if the extension is xz??
# whereas magic detects correctly: application/x-zstd and application/x-xz
def fastermime(path: Path | str) -> str:
paths = str(path)
# mimetypes is faster, so try it first
(mime, _) = mimetypes.guess_type(paths)
if mime is not None:
return mime
# magic is slower but handles more types
# TODO Result type?; it's kinda racey, but perhaps better to let the caller decide?
return _magic().from_file(paths)

View file

@ -1,10 +1,13 @@
"""
Various helpers for reading org-mode data
"""
from datetime import datetime
def parse_org_datetime(s: str) -> datetime:
s = s.strip('[]')
for fmt, cl in [
for fmt, _cls in [
("%Y-%m-%d %a %H:%M", datetime),
("%Y-%m-%d %H:%M" , datetime),
# todo not sure about these... fallback on 00:00?
@ -15,23 +18,29 @@ def parse_org_datetime(s: str) -> datetime:
return datetime.strptime(s, fmt)
except ValueError:
continue
else:
raise RuntimeError(f"Bad datetime string {s}")
# TODO I guess want to borrow inspiration from bs4? element type <-> tag; and similar logic for find_one, find_all
from collections.abc import Iterable
from typing import Callable, TypeVar
from orgparse import OrgNode
from typing import Iterable, TypeVar, Callable
V = TypeVar('V')
def collect(n: OrgNode, cfun: Callable[[OrgNode], Iterable[V]]) -> Iterable[V]:
yield from cfun(n)
for c in n.children:
yield from collect(c, cfun)
from more_itertools import one
from orgparse.extra import Table
def one_table(o: OrgNode) -> Table:
return one(collect(o, lambda n: (x for x in n.body_rich if isinstance(x, Table))))

View file

@ -1,32 +1,54 @@
'''
Various pandas helpers and convenience functions
'''
from __future__ import annotations
# todo not sure if belongs to 'core'. It's certainly 'more' core than actual modules, but still not essential
# NOTE: this file is meant to be importable without Pandas installed
from datetime import datetime
import dataclasses
from collections.abc import Iterable, Iterator
from datetime import datetime, timezone
from pprint import pformat
from typing import Optional, TYPE_CHECKING, Any, Iterable, Type, Dict
from . import warnings, Res
from .common import LazyLogger, Json, asdict
from typing import (
TYPE_CHECKING,
Any,
Callable,
Literal,
TypeVar,
)
logger = LazyLogger(__name__)
from decorator import decorator
from . import warnings
from .error import Res, error_to_json, extract_error_datetime
from .logging import make_logger
from .types import Json, asdict
logger = make_logger(__name__)
if TYPE_CHECKING:
# this is kinda pointless at the moment, but handy to annotate DF returning methods now
# later will be unignored when they implement type annotations
import pandas as pd # type: ignore
# DataFrameT = pd.DataFrame
# TODO ugh. pretty annoying, having any is not very useful since it would allow arbitrary coercions..
# ideally want to use a type that's like Any but doesn't allow arbitrary coercions??
DataFrameT = Any
import pandas as pd
DataFrameT = pd.DataFrame
SeriesT = pd.Series
from pandas._typing import S1 # meh
FuncT = TypeVar('FuncT', bound=Callable[..., DataFrameT])
# huh interesting -- with from __future__ import annotations don't even need else clause here?
# but still if other modules import these we do need some fake runtime types here..
else:
# in runtime, make it defensive so it works without pandas
from typing import Optional
DataFrameT = Any
SeriesT = Optional # just some type with one argument
S1 = Any
def check_dateish(s) -> Iterable[str]:
import pandas as pd # type: ignore # noqa: F811 not actually a redefinition
def _check_dateish(s: SeriesT[S1]) -> Iterable[str]:
import pandas as pd # noqa: F811 not actually a redefinition
ctype = s.dtype
if str(ctype).startswith('datetime64'):
return
@ -36,7 +58,7 @@ def check_dateish(s) -> Iterable[str]:
all_timestamps = s.apply(lambda x: isinstance(x, (pd.Timestamp, datetime))).all()
if not all_timestamps:
return # not sure why it would happen, but ok
tzs = s.map(lambda x: x.tzinfo).drop_duplicates()
tzs = s.map(lambda x: x.tzinfo).drop_duplicates() # type: ignore[union-attr, var-annotated, arg-type, return-value, unused-ignore]
examples = s[tzs.index]
# todo not so sure this warning is that useful... except for stuff without tz
yield f'''
@ -45,13 +67,50 @@ def check_dateish(s) -> Iterable[str]:
'''.strip()
from .compat import Literal
def test_check_dateish() -> None:
import pandas as pd
from .compat import fromisoformat
# empty series shouldn't warn
assert list(_check_dateish(pd.Series([]))) == []
# if no dateimes, shouldn't return any warnings
assert list(_check_dateish(pd.Series([1, 2, 3]))) == []
# all values are datetimes, shouldn't warn
# fmt: off
assert list(_check_dateish(pd.Series([
fromisoformat('2024-08-19T01:02:03'),
fromisoformat('2024-08-19T03:04:05'),
]))) == []
# fmt: on
# mixture of timezones -- should warn
# fmt: off
assert len(list(_check_dateish(pd.Series([
fromisoformat('2024-08-19T01:02:03'),
fromisoformat('2024-08-19T03:04:05Z'),
])))) == 1
# fmt: on
# TODO hmm. maybe this should actually warn?
# fmt: off
assert len(list(_check_dateish(pd.Series([
'whatever',
fromisoformat('2024-08-19T01:02:03'),
])))) == 0
# fmt: on
# fmt: off
ErrorColPolicy = Literal[
'add_if_missing', # add error column if it's missing
'warn' , # warn, but do not modify
'ignore' , # no warnings
]
# fmt: on
def check_error_column(df: DataFrameT, *, policy: ErrorColPolicy) -> Iterable[str]:
if 'error' in df:
@ -71,19 +130,15 @@ No 'error' column detected. You probably forgot to handle errors defensively, wh
yield wmsg
from typing import Any, Callable, TypeVar
FuncT = TypeVar('FuncT', bound=Callable[..., DataFrameT])
# TODO ugh. typing this is a mess... shoul I use mypy_extensions.VarArg/KwArgs?? or what??
from decorator import decorator
# TODO ugh. typing this is a mess... perhaps should use .compat.ParamSpec?
@decorator
def check_dataframe(f: FuncT, error_col_policy: ErrorColPolicy = 'add_if_missing', *args, **kwargs) -> DataFrameT:
df = f(*args, **kwargs)
df: DataFrameT = f(*args, **kwargs)
tag = '{f.__module__}:{f.__name__}'
# makes sense to keep super defensive
try:
for col, data in df.reset_index().iteritems():
for w in check_dateish(data):
for col, data in df.reset_index().items():
for w in _check_dateish(data):
warnings.low(f"{tag}, column '{col}': {w}")
except Exception as e:
logger.exception(e)
@ -94,11 +149,11 @@ def check_dataframe(f: FuncT, error_col_policy: ErrorColPolicy='add_if_missing',
logger.exception(e)
return df
# todo doctor: could have a suggesion to wrap dataframes with it?? discover by return type?
def error_to_row(e: Exception, *, dt_col: str='dt', tz=None) -> Json:
from .error import error_to_json, extract_error_datetime
def error_to_row(e: Exception, *, dt_col: str = 'dt', tz: timezone | None = None) -> Json:
edt = extract_error_datetime(e)
if edt is not None and edt.tzinfo is None and tz is not None:
edt = edt.replace(tzinfo=tz)
@ -107,8 +162,7 @@ def error_to_row(e: Exception, *, dt_col: str='dt', tz=None) -> Json:
return err_dict
# todo not sure about naming
def to_jsons(it: Iterable[Res[Any]]) -> Iterable[Json]:
def _to_jsons(it: Iterable[Res[Any]]) -> Iterable[Json]:
for r in it:
if isinstance(r, Exception):
yield error_to_row(r)
@ -120,11 +174,11 @@ def to_jsons(it: Iterable[Res[Any]]) -> Iterable[Json]:
# no type for dataclass?
Schema = Any
def _as_columns(s: Schema) -> Dict[str, Type]:
def _as_columns(s: Schema) -> dict[str, type]:
# todo would be nice to extract properties; add tests for this as well
import dataclasses as D
if D.is_dataclass(s):
return {f.name: f.type for f in D.fields(s)}
if dataclasses.is_dataclass(s):
return {f.name: f.type for f in dataclasses.fields(s)} # type: ignore[misc] # ugh, why mypy thinks f.type can return str??
# else must be NamedTuple??
# todo assert my.core.common.is_namedtuple?
return getattr(s, '_field_types')
@ -132,7 +186,7 @@ def _as_columns(s: Schema) -> Dict[str, Type]:
# todo add proper types
@check_dataframe
def as_dataframe(it: Iterable[Res[Any]], schema: Optional[Schema]=None) -> DataFrameT:
def as_dataframe(it: Iterable[Res[Any]], schema: Schema | None = None) -> DataFrameT:
# todo warn if schema isn't specified?
# ok nice supports dataframe/NT natively
# https://github.com/pandas-dev/pandas/pull/27999
@ -141,26 +195,88 @@ def as_dataframe(it: Iterable[Res[Any]], schema: Optional[Schema]=None) -> DataF
# same for NamedTuple -- seems that it takes whatever schema the first NT has
# so we need to convert each individually... sigh
import pandas as pd # noqa: F811 not actually a redefinition
columns = None if schema is None else list(_as_columns(schema).keys())
return pd.DataFrame(to_jsons(it), columns=columns)
return pd.DataFrame(_to_jsons(it), columns=columns)
# ugh. in principle this could be inside the test
# might be due to use of from __future__ import annotations
# can quickly reproduce by running pytest tests/tz.py tests/core/test_pandas.py
# possibly will be resolved after fix in pytest?
# see https://github.com/pytest-dev/pytest/issues/7856
@dataclasses.dataclass
class _X:
# FIXME try moving inside?
x: int
def test_as_dataframe() -> None:
import numpy as np
import pandas as pd
import pytest
it = (dict(i=i, s=f'str{i}') for i in range(10))
from pandas.testing import assert_frame_equal
from .compat import fromisoformat
it = ({'i': i, 's': f'str{i}'} for i in range(5))
with pytest.warns(UserWarning, match=r"No 'error' column") as record_warnings: # noqa: F841
df = as_dataframe(it)
df: DataFrameT = as_dataframe(it)
# todo test other error col policies
assert list(df.columns) == ['i', 's', 'error']
assert len(as_dataframe([])) == 0
# fmt: off
assert_frame_equal(
df,
pd.DataFrame({
'i' : [0 , 1 , 2 , 3 , 4 ],
's' : ['str0', 'str1', 'str2', 'str3', 'str4'],
# NOTE: error column is always added
'error': [None , None , None , None , None ],
}),
)
# fmt: on
assert_frame_equal(as_dataframe([]), pd.DataFrame(columns=['error']))
from dataclasses import dataclass
df2: DataFrameT = as_dataframe([], schema=_X)
assert_frame_equal(
df2,
# FIXME hmm. x column type should be an int?? and error should be string (or object??)
pd.DataFrame(columns=['x', 'error']),
)
@dataclass
class X:
x: int
@dataclasses.dataclass
class S:
value: str
# makes sense to specify the schema so the downstream program doesn't fail in case of empty iterable
df = as_dataframe([], schema=X)
assert list(df.columns) == ['x', 'error']
def it2() -> Iterator[Res[S]]:
yield S(value='test')
yield RuntimeError('i failed')
df = as_dataframe(it2())
# fmt: off
assert_frame_equal(
df,
pd.DataFrame(data={
'value': ['test', np.nan ],
'error': [np.nan, 'RuntimeError: i failed\n'],
'dt' : [np.nan, np.nan ],
}).astype(dtype={'dt': 'float'}), # FIXME should be datetime64 as below
)
# fmt: on
def it3() -> Iterator[Res[S]]:
yield S(value='aba')
yield RuntimeError('whoops')
yield S(value='cde')
yield RuntimeError('exception with datetime', fromisoformat('2024-08-19T22:47:01Z'))
df = as_dataframe(it3())
# fmt: off
assert_frame_equal(df, pd.DataFrame(data={
'value': ['aba' , np.nan , 'cde' , np.nan ],
'error': [np.nan, 'RuntimeError: whoops\n', np.nan, "RuntimeError: ('exception with datetime', datetime.datetime(2024, 8, 19, 22, 47, 1, tzinfo=datetime.timezone.utc))\n"],
# note: dt column is added even if errors don't have an associated datetime
'dt' : [np.nan, np.nan , np.nan, '2024-08-19 22:47:01+00:00'],
}).astype(dtype={'dt': 'datetime64[ns, UTC]'}))
# fmt: on

View file

@ -1,8 +1,14 @@
from pathlib import Path
# todo preinit isn't really a good name? it's only in a separate file because
# - it's imported from my.core.init (so we wan't to keep this file as small/reliable as possible, hence not common or something)
# - we still need this function in __main__, so has to be separate from my/core/init.py
def get_mycfg_dir() -> Path:
import appdirs # type: ignore[import]
import os
import appdirs # type: ignore[import-untyped]
# not sure if that's necessary, i.e. could rely on PYTHONPATH instead
# on the other hand, by using MY_CONFIG we are guaranteed to load it from the desired path?
mvar = os.environ.get('MY_CONFIG')

24
my/core/pytest.py Normal file
View file

@ -0,0 +1,24 @@
"""
Helpers to prevent depending on pytest in runtime
"""
from .internal import assert_subpackage
assert_subpackage(__name__)
import sys
import typing
under_pytest = 'pytest' in sys.modules
if typing.TYPE_CHECKING or under_pytest:
import pytest
parametrize = pytest.mark.parametrize
else:
def parametrize(*_args, **_kwargs):
def wrapper(f):
return f
return wrapper

View file

@ -5,20 +5,29 @@ The main entrypoint to this library is the 'select' function below; try:
python3 -c "from my.core.query import select; help(select)"
"""
from __future__ import annotations
import dataclasses
import importlib
import inspect
import itertools
from collections.abc import Iterable, Iterator
from datetime import datetime
from typing import TypeVar, Tuple, Optional, Union, Callable, Iterable, Iterator, Dict, Any, NamedTuple, List
from typing import (
Any,
Callable,
NamedTuple,
Optional,
TypeVar,
)
import more_itertools
from .common import is_namedtuple
from . import error as err
from .error import Res, unwrap
from .types import is_namedtuple
from .warnings import low
T = TypeVar("T")
ET = Res[T]
@ -26,7 +35,7 @@ ET = Res[T]
U = TypeVar("U")
# In a perfect world, the return value from a OrderFunc would just be U,
# not Optional[U]. However, since this has to deal with so many edge
# cases, theres a possibility that the functions generated by
# cases, there's a possibility that the functions generated by
# _generate_order_by_func can't find an attribute
OrderFunc = Callable[[ET], Optional[U]]
Where = Callable[[ET], bool]
@ -39,6 +48,7 @@ class Unsortable(NamedTuple):
class QueryException(ValueError):
"""Used to differentiate query-related errors, so the CLI interface is more expressive"""
pass
@ -51,16 +61,16 @@ def locate_function(module_name: str, function_name: str) -> Callable[[], Iterab
"""
try:
mod = importlib.import_module(module_name)
for (fname, func) in inspect.getmembers(mod, inspect.isfunction):
for fname, f in inspect.getmembers(mod, inspect.isfunction):
if fname == function_name:
return func
return f
# in case the function is defined dynamically,
# like with a globals().setdefault(...) or a module-level __getattr__ function
func = getattr(mod, function_name, None)
if func is not None and callable(func):
return func
except Exception as e:
raise QueryException(str(e))
raise QueryException(str(e)) # noqa: B904
raise QueryException(f"Could not find function '{function_name}' in '{module_name}'")
@ -74,7 +84,7 @@ def locate_qualified_function(qualified_name: str) -> Callable[[], Iterable[ET]]
return locate_function(qualified_name[:rdot_index], qualified_name[rdot_index + 1 :])
def attribute_func(obj: T, where: Where, default: Optional[U] = None) -> Optional[OrderFunc]:
def attribute_func(obj: T, where: Where, default: U | None = None) -> OrderFunc | None:
"""
Attempts to find an attribute which matches the 'where_function' on the object,
using some getattr/dict checks. Returns a function which when called with
@ -102,7 +112,7 @@ def attribute_func(obj: T, where: Where, default: Optional[U] = None) -> Optiona
if where(v):
return lambda o: o.get(k, default) # type: ignore[union-attr]
elif dataclasses.is_dataclass(obj):
for (field_name, _annotation) in obj.__annotations__.items():
for field_name in obj.__annotations__.keys():
if where(getattr(obj, field_name)):
return lambda o: getattr(o, field_name, default)
elif is_namedtuple(obj):
@ -120,11 +130,12 @@ def attribute_func(obj: T, where: Where, default: Optional[U] = None) -> Optiona
def _generate_order_by_func(
obj_res: Res[T],
key: Optional[str] = None,
where_function: Optional[Where] = None,
default: Optional[U] = None,
*,
key: str | None = None,
where_function: Where | None = None,
default: U | None = None,
force_unsortable: bool = False,
) -> Optional[OrderFunc]:
) -> OrderFunc | None:
"""
Accepts an object Res[T] (Instance of some class or Exception)
@ -177,7 +188,7 @@ pass 'drop_exceptions' to ignore exceptions""")
return lambda o: o.get(key, default) # type: ignore[union-attr]
else:
if hasattr(obj, key):
return lambda o: getattr(o, key, default) # type: ignore[arg-type]
return lambda o: getattr(o, key, default)
# Note: if the attribute you're ordering by is an Optional type,
# and on some objects it'll return None, the getattr(o, field_name, default) won't
@ -189,7 +200,7 @@ pass 'drop_exceptions' to ignore exceptions""")
# user must provide either a key or a where predicate
if where_function is not None:
func: Optional[OrderFunc] = attribute_func(obj, where_function, default)
func: OrderFunc | None = attribute_func(obj, where_function, default)
if func is not None:
return func
@ -205,29 +216,13 @@ pass 'drop_exceptions' to ignore exceptions""")
return None # couldn't compute a OrderFunc for this class/instance
def _drop_exceptions(itr: Iterator[ET]) -> Iterator[T]:
"""Return non-errors from the iterable"""
for o in itr:
if isinstance(o, Exception):
continue
yield o
def _raise_exceptions(itr: Iterable[ET]) -> Iterator[T]:
"""Raise errors from the iterable, stops the select function"""
for o in itr:
if isinstance(o, Exception):
raise o
yield o
# currently using the 'key set' as a proxy for 'this is the same type of thing'
def _determine_order_by_value_key(obj_res: ET) -> Any:
"""
Returns either the class, or a tuple of the dictionary keys
"""
key = obj_res.__class__
if key == dict:
if key is dict:
# assuming same keys signify same way to determine ordering
return tuple(obj_res.keys()) # type: ignore[union-attr]
return key
@ -244,8 +239,8 @@ def _drop_unsorted(itr: Iterator[ET], orderfunc: OrderFunc) -> Iterator[ET]:
# try getting the first value from the iterator
# similar to my.core.common.warn_if_empty? this doesnt go through the whole iterator though
def _peek_iter(itr: Iterator[ET]) -> Tuple[Optional[ET], Iterator[ET]]:
# similar to my.core.common.warn_if_empty? this doesn't go through the whole iterator though
def _peek_iter(itr: Iterator[ET]) -> tuple[ET | None, Iterator[ET]]:
itr = more_itertools.peekable(itr)
try:
first_item = itr.peek()
@ -256,9 +251,9 @@ def _peek_iter(itr: Iterator[ET]) -> Tuple[Optional[ET], Iterator[ET]]:
# similar to 'my.core.error.sort_res_by'?
def _wrap_unsorted(itr: Iterator[ET], orderfunc: OrderFunc) -> Tuple[Iterator[Unsortable], Iterator[ET]]:
unsortable: List[Unsortable] = []
sortable: List[ET] = []
def _wrap_unsorted(itr: Iterator[ET], orderfunc: OrderFunc) -> tuple[Iterator[Unsortable], Iterator[ET]]:
unsortable: list[Unsortable] = []
sortable: list[ET] = []
for o in itr:
# if input to select was another select
if isinstance(o, Unsortable):
@ -276,10 +271,11 @@ def _wrap_unsorted(itr: Iterator[ET], orderfunc: OrderFunc) -> Tuple[Iterator[Un
# the second being items for which orderfunc returned a non-none value
def _handle_unsorted(
itr: Iterator[ET],
*,
orderfunc: OrderFunc,
drop_unsorted: bool,
wrap_unsorted: bool
) -> Tuple[Iterator[Unsortable], Iterator[ET]]:
) -> tuple[Iterator[Unsortable], Iterator[ET]]:
# prefer drop_unsorted to wrap_unsorted, if both were present
if drop_unsorted:
return iter([]), _drop_unsorted(itr, orderfunc)
@ -290,20 +286,20 @@ def _handle_unsorted(
return iter([]), itr
# handles creating an order_value functon, using a lookup for
# handles creating an order_value function, using a lookup for
# different types. ***This consumes the iterator***, so
# you should definitely itertoolts.tee it beforehand
# as to not exhaust the values
def _generate_order_value_func(itr: Iterator[ET], order_value: Where, default: Optional[U] = None) -> OrderFunc:
def _generate_order_value_func(itr: Iterator[ET], order_value: Where, default: U | None = None) -> OrderFunc:
# TODO: add a kwarg to force lookup for every item? would sort of be like core.common.guess_datetime then
order_by_lookup: Dict[Any, OrderFunc] = {}
order_by_lookup: dict[Any, OrderFunc] = {}
# need to go through a copy of the whole iterator here to
# pre-generate functions to support sorting mixed types
for obj_res in itr:
key: Any = _determine_order_by_value_key(obj_res)
if key not in order_by_lookup:
keyfunc: Optional[OrderFunc] = _generate_order_by_func(
keyfunc: OrderFunc | None = _generate_order_by_func(
obj_res,
where_function=order_value,
default=default,
@ -324,12 +320,12 @@ def _generate_order_value_func(itr: Iterator[ET], order_value: Where, default: O
def _handle_generate_order_by(
itr,
*,
order_by: Optional[OrderFunc] = None,
order_key: Optional[str] = None,
order_value: Optional[Where] = None,
default: Optional[U] = None,
) -> Tuple[Optional[OrderFunc], Iterator[ET]]:
order_by_chosen: Optional[OrderFunc] = order_by # if the user just supplied a function themselves
order_by: OrderFunc | None = None,
order_key: str | None = None,
order_value: Where | None = None,
default: U | None = None,
) -> tuple[OrderFunc | None, Iterator[ET]]:
order_by_chosen: OrderFunc | None = order_by # if the user just supplied a function themselves
if order_by is not None:
return order_by, itr
if order_key is not None:
@ -354,17 +350,19 @@ def _handle_generate_order_by(
def select(
src: Union[Iterable[ET], Callable[[], Iterable[ET]]],
src: Iterable[ET] | Callable[[], Iterable[ET]],
*,
where: Optional[Where] = None,
order_by: Optional[OrderFunc] = None,
order_key: Optional[str] = None,
order_value: Optional[Where] = None,
default: Optional[U] = None,
where: Where | None = None,
order_by: OrderFunc | None = None,
order_key: str | None = None,
order_value: Where | None = None,
default: U | None = None,
reverse: bool = False,
limit: Optional[int] = None,
limit: int | None = None,
drop_unsorted: bool = False,
wrap_unsorted: bool = True,
warn_exceptions: bool = False,
warn_func: Callable[[Exception], None] | None = None,
drop_exceptions: bool = False,
raise_exceptions: bool = False,
) -> Iterator[ET]:
@ -374,7 +372,7 @@ def select(
by allowing you to provide custom predicates (functions) which can sort
by a function, an attribute, dict key, or by the attributes values.
Since this supports mixed types, theres always a possibility
Since this supports mixed types, there's always a possibility
of KeyErrors or AttributeErrors while trying to find some value to order by,
so this provides multiple mechanisms to deal with that
@ -408,7 +406,9 @@ def select(
to copy the iterator in memory (using itertools.tee) to determine how to order it
in memory
The 'drop_exceptions' and 'raise_exceptions' let you ignore or raise when the src contains exceptions
The 'drop_exceptions', 'raise_exceptions', 'warn_exceptions' let you ignore or raise
when the src contains exceptions. The 'warn_func' lets you provide a custom function
to call when an exception is encountered instead of using the 'warnings' module
src: an iterable of mixed types, or a function to be called,
as the input to this function
@ -464,15 +464,18 @@ Will attempt to call iter() on the value""")
try:
itr: Iterator[ET] = iter(it)
except TypeError as t:
raise QueryException("Could not convert input src to an Iterator: " + str(t))
raise QueryException("Could not convert input src to an Iterator: " + str(t)) # noqa: B904
# if both drop_exceptions and drop_exceptions are provided for some reason,
# should raise exceptions before dropping them
if raise_exceptions:
itr = _raise_exceptions(itr)
itr = err.raise_exceptions(itr)
if drop_exceptions:
itr = _drop_exceptions(itr)
itr = err.drop_exceptions(itr)
if warn_exceptions:
itr = err.warn_exceptions(itr, warn_func=warn_func)
if where is not None:
itr = filter(where, itr)
@ -498,10 +501,15 @@ Will attempt to call iter() on the value""")
# note: can't just attach sort unsortable values in the same iterable as the
# other items because they don't have any lookups for order_key or functions
# to handle items in the order_by_lookup dictionary
unsortable, itr = _handle_unsorted(itr, order_by_chosen, drop_unsorted, wrap_unsorted)
unsortable, itr = _handle_unsorted(
itr,
orderfunc=order_by_chosen,
drop_unsorted=drop_unsorted,
wrap_unsorted=wrap_unsorted,
)
# run the sort, with the computed order by function
itr = iter(sorted(itr, key=order_by_chosen, reverse=reverse)) # type: ignore[arg-type, type-var]
itr = iter(sorted(itr, key=order_by_chosen, reverse=reverse)) # type: ignore[arg-type]
# re-attach unsortable values to the front/back of the list
if reverse:
@ -589,7 +597,7 @@ def test_couldnt_determine_order() -> None:
res = list(select(iter([object()]), order_value=lambda o: isinstance(o, datetime)))
assert len(res) == 1
assert isinstance(res[0], Unsortable)
assert type(res[0].obj) == object
assert type(res[0].obj) is object
# same value type, different keys, with clashing keys
@ -605,7 +613,7 @@ class _B(NamedTuple):
# move these to tests/? They are re-used so much in the tests below,
# not sure where the best place for these is
def _mixed_iter() -> Iterator[Union[_A, _B]]:
def _mixed_iter() -> Iterator[_A | _B]:
yield _A(x=datetime(year=2009, month=5, day=10, hour=4, minute=10, second=1), y=5, z=10)
yield _B(y=datetime(year=2015, month=5, day=10, hour=4, minute=10, second=1))
yield _A(x=datetime(year=2005, month=5, day=10, hour=4, minute=10, second=1), y=10, z=2)
@ -614,7 +622,7 @@ def _mixed_iter() -> Iterator[Union[_A, _B]]:
yield _A(x=datetime(year=2005, month=4, day=10, hour=4, minute=10, second=1), y=2, z=-5)
def _mixed_iter_errors() -> Iterator[Res[Union[_A, _B]]]:
def _mixed_iter_errors() -> Iterator[Res[_A | _B]]:
m = _mixed_iter()
yield from itertools.islice(m, 0, 3)
yield RuntimeError("Unhandled error!")
@ -650,7 +658,7 @@ def test_wrap_unsortable() -> None:
# by default, wrap unsortable
res = list(select(_mixed_iter(), order_key="z"))
assert Counter(map(lambda t: type(t).__name__, res)) == Counter({"_A": 4, "Unsortable": 2})
assert Counter(type(t).__name__ for t in res) == Counter({"_A": 4, "Unsortable": 2})
def test_disabled_wrap_unsorted() -> None:
@ -669,7 +677,7 @@ def test_drop_unsorted() -> None:
# test drop unsortable, should remove them before the 'sorted' call
res = list(select(_mixed_iter(), order_key="z", wrap_unsorted=False, drop_unsorted=True))
assert len(res) == 4
assert Counter(map(lambda t: type(t).__name__, res)) == Counter({"_A": 4})
assert Counter(type(t).__name__ for t in res) == Counter({"_A": 4})
def test_drop_exceptions() -> None:
@ -693,15 +701,16 @@ def test_raise_exceptions() -> None:
def test_wrap_unsortable_with_error_and_warning() -> None:
import pytest
from collections import Counter
import pytest
# by default should wrap unsortable (error)
with pytest.warns(UserWarning, match=r"encountered exception"):
res = list(select(_mixed_iter_errors(), order_value=lambda o: isinstance(o, datetime)))
assert Counter(map(lambda t: type(t).__name__, res)) == Counter({"_A": 4, "_B": 2, "Unsortable": 1})
assert Counter(type(t).__name__ for t in res) == Counter({"_A": 4, "_B": 2, "Unsortable": 1})
# compare the returned error wrapped in the Unsortable
returned_error = next((o for o in res if isinstance(o, Unsortable))).obj
returned_error = next(o for o in res if isinstance(o, Unsortable)).obj
assert "Unhandled error!" == str(returned_error)
@ -711,7 +720,7 @@ def test_order_key_unsortable() -> None:
# both unsortable and items which dont match the order_by (order_key) in this case should be classified unsorted
res = list(select(_mixed_iter_errors(), order_key="z"))
assert Counter(map(lambda t: type(t).__name__, res)) == Counter({"_A": 4, "Unsortable": 3})
assert Counter(type(t).__name__ for t in res) == Counter({"_A": 4, "Unsortable": 3})
def test_order_default_param() -> None:
@ -731,7 +740,7 @@ def test_no_recursive_unsortables() -> None:
# select to select as input, wrapping unsortables the first time, second should drop them
# reverse=True to send errors to the end, so the below order_key works
res = list(select(_mixed_iter_errors(), order_key="z", reverse=True))
assert Counter(map(lambda t: type(t).__name__, res)) == Counter({"_A": 4, "Unsortable": 3})
assert Counter(type(t).__name__ for t in res) == Counter({"_A": 4, "Unsortable": 3})
# drop_unsorted
dropped = list(select(res, order_key="z", drop_unsorted=True))

View file

@ -7,27 +7,30 @@ filtered iterator
See the select_range function below
"""
from __future__ import annotations
import re
import time
from functools import lru_cache
from datetime import datetime, timedelta, date
from typing import Callable, Iterator, NamedTuple, Optional, Any, Type
from collections.abc import Iterator
from datetime import date, datetime, timedelta
from functools import cache
from typing import Any, Callable, NamedTuple
import more_itertools
from .compat import fromisoformat
from .query import (
QueryException,
select,
ET,
OrderFunc,
QueryException,
Where,
_handle_generate_order_by,
ET,
select,
)
from .common import isoparse
timedelta_regex = re.compile(r"^((?P<weeks>[\.\d]+?)w)?((?P<days>[\.\d]+?)d)?((?P<hours>[\.\d]+?)h)?((?P<minutes>[\.\d]+?)m)?((?P<seconds>[\.\d]+?)s)?$")
timedelta_regex = re.compile(
r"^((?P<weeks>[\.\d]+?)w)?((?P<days>[\.\d]+?)d)?((?P<hours>[\.\d]+?)h)?((?P<minutes>[\.\d]+?)m)?((?P<seconds>[\.\d]+?)s)?$"
)
# https://stackoverflow.com/a/51916936
@ -40,7 +43,7 @@ def parse_timedelta_string(timedelta_str: str) -> timedelta:
if parts is None:
raise ValueError(f"Could not parse time duration from {timedelta_str}.\nValid examples: '8h', '1w2d8h5m20s', '2m4s'")
time_params = {name: float(param) for name, param in parts.groupdict().items() if param}
return timedelta(**time_params) # type: ignore[arg-type]
return timedelta(**time_params)
def parse_timedelta_float(timedelta_str: str) -> float:
@ -73,19 +76,34 @@ def parse_datetime_float(date_str: str) -> float:
return ds_float
try:
# isoformat - default format when you call str() on datetime
# this also parses dates like '2020-01-01'
return datetime.fromisoformat(ds).timestamp()
except ValueError:
pass
try:
return isoparse(ds).timestamp()
return fromisoformat(ds).timestamp()
except (AssertionError, ValueError):
pass
try:
import dateparser
except ImportError:
pass
else:
# dateparser is a bit more lenient than the above, lets you type
# all sorts of dates as inputs
# https://github.com/scrapinghub/dateparser#how-to-use
res: datetime | None = dateparser.parse(ds, settings={"DATE_ORDER": "YMD"})
if res is not None:
return res.timestamp()
raise QueryException(f"Was not able to parse {ds} into a datetime")
# probably DateLike input? but a user could specify an order_key
# which is an epoch timestamp or a float value which they
# expect to be converted to a datetime to compare
@lru_cache(maxsize=None)
@cache
def _datelike_to_float(dl: Any) -> float:
if isinstance(dl, datetime):
return dl.timestamp()
@ -96,7 +114,7 @@ def _datelike_to_float(dl: Any) -> float:
try:
return parse_datetime_float(dl)
except QueryException as q:
raise QueryException(f"While attempting to extract datetime from {dl}, to order by datetime:\n\n" + str(q))
raise QueryException(f"While attempting to extract datetime from {dl}, to order by datetime:\n\n" + str(q)) # noqa: B904
class RangeTuple(NamedTuple):
@ -117,11 +135,12 @@ class RangeTuple(NamedTuple):
of the timeframe -- 'before'
- before and after - anything after 'after' and before 'before', acts as a time range
"""
# technically doesn't need to be Optional[Any],
# just to make it more clear these can be None
after: Optional[Any]
before: Optional[Any]
within: Optional[Any]
after: Any | None
before: Any | None
within: Any | None
Converter = Callable[[Any], Any]
@ -132,14 +151,15 @@ def _parse_range(
unparsed_range: RangeTuple,
end_parser: Converter,
within_parser: Converter,
parsed_range: Optional[RangeTuple] = None,
error_message: Optional[str] = None
) -> Optional[RangeTuple]:
parsed_range: RangeTuple | None = None,
error_message: str | None = None,
) -> RangeTuple | None:
if parsed_range is not None:
return parsed_range
err_msg = error_message or RangeTuple.__doc__
assert err_msg is not None # make mypy happy
after, before, within = None, None, None
none_count = more_itertools.ilen(filter(lambda o: o is None, list(unparsed_range)))
@ -162,11 +182,11 @@ def _create_range_filter(
end_parser: Converter,
within_parser: Converter,
attr_func: Where,
parsed_range: Optional[RangeTuple] = None,
default_before: Optional[Any] = None,
value_coercion_func: Optional[Converter] = None,
error_message: Optional[str] = None,
) -> Optional[Where]:
parsed_range: RangeTuple | None = None,
default_before: Any | None = None,
value_coercion_func: Converter | None = None,
error_message: str | None = None,
) -> Where | None:
"""
Handles:
- parsing the user input into values that are comparable to items the iterable returns
@ -220,7 +240,7 @@ def _create_range_filter(
# inclusivity here? Is [after, before) currently,
# items are included on the lower bound but not the
# upper bound
# typically used for datetimes so doesnt have to
# typically used for datetimes so doesn't have to
# be exact in that case
def generated_predicate(obj: Any) -> bool:
ov: Any = attr_func(obj)
@ -258,15 +278,17 @@ def _create_range_filter(
def select_range(
itr: Iterator[ET],
*,
where: Optional[Where] = None,
order_key: Optional[str] = None,
order_value: Optional[Where] = None,
order_by_value_type: Optional[Type] = None,
unparsed_range: Optional[RangeTuple] = None,
where: Where | None = None,
order_key: str | None = None,
order_value: Where | None = None,
order_by_value_type: type | None = None,
unparsed_range: RangeTuple | None = None,
reverse: bool = False,
limit: Optional[int] = None,
limit: int | None = None,
drop_unsorted: bool = False,
wrap_unsorted: bool = False,
warn_exceptions: bool = False,
warn_func: Callable[[Exception], None] | None = None,
drop_exceptions: bool = False,
raise_exceptions: bool = False,
) -> Iterator[ET]:
@ -293,21 +315,30 @@ def select_range(
unparsed_range = None
# some operations to do before ordering/filtering
if drop_exceptions or raise_exceptions or where is not None:
# doesnt wrap unsortable items, because we pass no order related kwargs
itr = select(itr, where=where, drop_exceptions=drop_exceptions, raise_exceptions=raise_exceptions)
if drop_exceptions or raise_exceptions or where is not None or warn_exceptions:
# doesn't wrap unsortable items, because we pass no order related kwargs
itr = select(
itr,
where=where,
drop_exceptions=drop_exceptions,
raise_exceptions=raise_exceptions,
warn_exceptions=warn_exceptions,
warn_func=warn_func,
)
order_by_chosen: Optional[OrderFunc] = None
order_by_chosen: OrderFunc | None = None
# if the user didn't specify an attribute to order value, but specified a type
# we should search for on each value in the iterator
if order_value is None and order_by_value_type is not None:
# search for that type on the iterator object
order_value = lambda o: isinstance(o, order_by_value_type) # type: ignore
order_value = lambda o: isinstance(o, order_by_value_type)
# if the user supplied a order_key, and/or we've generated an order_value, create
# the function that accesses that type on each value in the iterator
if order_key is not None or order_value is not None:
# _generate_order_value_func internally here creates a copy of the iterator, which has to
# be consumed in-case we're sorting by mixed types
order_by_chosen, itr = _handle_generate_order_by(itr, order_key=order_key, order_value=order_value)
# signifies that itr is empty -- can early return here
if order_by_chosen is None:
@ -319,11 +350,11 @@ def select_range(
if order_by_chosen is None:
raise QueryException("""Can't order by range if we have no way to order_by!
Specify a type or a key to order the value by""")
else:
# force drop_unsorted=True so we can use _create_range_filter
# sort the iterable by the generated order_by_chosen function
itr = select(itr, order_by=order_by_chosen, drop_unsorted=True)
filter_func: Optional[Where]
filter_func: Where | None
if order_by_value_type in [datetime, date]:
filter_func = _create_range_filter(
unparsed_range=unparsed_range,
@ -331,7 +362,8 @@ Specify a type or a key to order the value by""")
within_parser=parse_timedelta_float,
attr_func=order_by_chosen, # type: ignore[arg-type]
default_before=time.time(),
value_coercion_func=_datelike_to_float)
value_coercion_func=_datelike_to_float,
)
elif order_by_value_type in [int, float]:
# allow primitives to be converted using the default int(), float() callables
filter_func = _create_range_filter(
@ -340,7 +372,8 @@ Specify a type or a key to order the value by""")
within_parser=order_by_value_type,
attr_func=order_by_chosen, # type: ignore[arg-type]
default_before=None,
value_coercion_func=order_by_value_type)
value_coercion_func=order_by_value_type,
)
else:
# TODO: add additional kwargs to let the user sort by other values, by specifying the parsers?
# would need to allow passing the end_parser, within parser, default before and value_coercion_func...
@ -356,7 +389,7 @@ Specify a type or a key to order the value by""")
#
# this select is also run if the user didn't specify anything to
# order by, and is just returning the data in the same order as
# as the srouce iterable
# as the source iterable
# i.e. none of the range-related filtering code ran, this is just a select
itr = select(itr,
order_by=order_by_chosen,
@ -367,7 +400,7 @@ Specify a type or a key to order the value by""")
return itr
# re-use items from query for testing
# reuse items from query for testing
from .query import _A, _B, _Float, _mixed_iter_errors
@ -447,8 +480,8 @@ def test_range_predicate() -> None:
)
# filter from 0 to 5
rn: Optional[RangeTuple] = RangeTuple("0", "5", None)
zero_to_five_filter: Optional[Where] = int_filter_func(unparsed_range=rn)
rn: RangeTuple = RangeTuple("0", "5", None)
zero_to_five_filter: Where | None = int_filter_func(unparsed_range=rn)
assert zero_to_five_filter is not None
# this is just a Where function, given some input it return True/False if the value is allowed
assert zero_to_five_filter(3) is True
@ -461,6 +494,7 @@ def test_range_predicate() -> None:
rn = RangeTuple(None, 3, "3.5")
assert list(filter(int_filter_func(unparsed_range=rn, attr_func=identity), src())) == ["0", "1", "2"]
def test_parse_range() -> None:
from functools import partial
@ -483,7 +517,7 @@ def test_parse_range() -> None:
assert res2 == RangeTuple(after=start_date.timestamp(), before=end_date.timestamp(), within=None)
# cant specify all three
# can't specify all three
with pytest.raises(QueryException, match=r"Cannot specify 'after', 'before' and 'within'"):
dt_parse_range(unparsed_range=RangeTuple(str(start_date), str(end_date.timestamp()), "7d"))
@ -504,9 +538,8 @@ def test_parse_timedelta_string() -> None:
def test_parse_datetime_float() -> None:
pnow = parse_datetime_float("now")
sec_diff = abs((pnow - datetime.now().timestamp()))
sec_diff = abs(pnow - datetime.now().timestamp())
# should probably never fail? could mock time.time
# but there seems to be issues with doing that use C-libraries (as time.time) does
# https://docs.python.org/3/library/unittest.mock-examples.html#partial-mocking

View file

@ -1,12 +1,15 @@
import datetime
import dataclasses
from pathlib import Path
from decimal import Decimal
from typing import Any, Optional, Callable, NamedTuple
from functools import lru_cache
from __future__ import annotations
import datetime
from dataclasses import asdict, is_dataclass
from decimal import Decimal
from functools import cache
from pathlib import Path
from typing import Any, Callable, NamedTuple
from .common import is_namedtuple
from .error import error_to_json
from .pytest import parametrize
from .types import is_namedtuple
# note: it would be nice to combine the 'asdict' and _default_encode to some function
# that takes a complex python object and returns JSON-compatible fields, while still
@ -16,6 +19,8 @@ from .error import error_to_json
DefaultEncoder = Callable[[Any], Any]
Dumps = Callable[[Any], str]
def _default_encode(obj: Any) -> Any:
"""
@ -33,8 +38,9 @@ def _default_encode(obj: Any) -> Any:
# convert paths to their string representation
if isinstance(obj, Path):
return str(obj)
if dataclasses.is_dataclass(obj):
return dataclasses.asdict(obj)
if is_dataclass(obj):
assert not isinstance(obj, type) # to help mypy
return asdict(obj)
if isinstance(obj, Exception):
return error_to_json(obj)
# if something was stored as 'decimal', you likely
@ -53,19 +59,18 @@ def _default_encode(obj: Any) -> Any:
# could possibly run multiple times/raise warning if you provide different 'default'
# functions or change the kwargs? The alternative is to maintain all of this at the module
# level, which is just as annoying
@lru_cache(maxsize=None)
@cache
def _dumps_factory(**kwargs) -> Callable[[Any], str]:
use_default: DefaultEncoder = _default_encode
# if the user passed an additional 'default' parameter,
# try using that to serialize before before _default_encode
_additional_default: Optional[DefaultEncoder] = kwargs.get("default")
_additional_default: DefaultEncoder | None = kwargs.get("default")
if _additional_default is not None and callable(_additional_default):
def wrapped_default(obj: Any) -> Any:
assert _additional_default is not None
try:
# hmm... shouldn't mypy know that _additional_default is not None here?
# assert _additional_default is not None
return _additional_default(obj) # type: ignore[misc]
return _additional_default(obj)
except TypeError:
# expected TypeError, signifies couldn't be encoded by custom
# serializer function. Try _default_encode from here
@ -75,28 +80,35 @@ def _dumps_factory(**kwargs) -> Callable[[Any], str]:
kwargs["default"] = use_default
prefer_factory: str | None = kwargs.pop('_prefer_factory', None)
def orjson_factory() -> Dumps | None:
try:
import orjson
except ModuleNotFoundError:
return None
# todo: add orjson.OPT_NON_STR_KEYS? would require some bitwise ops
# most keys are typically attributes from a NT/Dataclass,
# so most seem to work: https://github.com/ijl/orjson#opt_non_str_keys
def _orjson_dumps(obj: Any) -> str:
def _orjson_dumps(obj: Any) -> str: # TODO rename?
# orjson returns json as bytes, encode to string
return orjson.dumps(obj, **kwargs).decode('utf-8')
return _orjson_dumps
except ModuleNotFoundError:
pass
def simplejson_factory() -> Dumps | None:
try:
from simplejson import dumps as simplejson_dumps
except ModuleNotFoundError:
return None
# if orjson couldn't be imported, try simplejson
# This is included for compatibility reasons because orjson
# is rust-based and compiling on rarer architectures may not work
# out of the box
#
# unlike the builtin JSON modue which serializes NamedTuples as lists
# unlike the builtin JSON module which serializes NamedTuples as lists
# (even if you provide a default function), simplejson correctly
# serializes namedtuples to dictionaries
@ -105,23 +117,42 @@ def _dumps_factory(**kwargs) -> Callable[[Any], str]:
return _simplejson_dumps
except ModuleNotFoundError:
pass
def stdlib_factory() -> Dumps | None:
import json
from .warnings import high
high("You might want to install 'orjson' to support serialization for lots more types! If that does not work for you, you can install 'simplejson' instead")
high(
"You might want to install 'orjson' to support serialization for lots more types! If that does not work for you, you can install 'simplejson' instead"
)
def _stdlib_dumps(obj: Any) -> str:
return json.dumps(obj, **kwargs)
return _stdlib_dumps
factories = {
'orjson': orjson_factory,
'simplejson': simplejson_factory,
'stdlib': stdlib_factory,
}
if prefer_factory is not None:
factory = factories[prefer_factory]
res = factory()
assert res is not None, prefer_factory
return res
for factory in factories.values():
res = factory()
if res is not None:
return res
raise RuntimeError("Should not happen!")
def dumps(
obj: Any,
default: Optional[DefaultEncoder] = None,
default: DefaultEncoder | None = None,
**kwargs,
) -> str:
"""
@ -154,10 +185,19 @@ def dumps(
return _dumps_factory(default=default, **kwargs)(obj)
def test_serialize_fallback() -> None:
import json as jsn # dont cause possible conflicts with module code
@parametrize('factory', ['orjson', 'simplejson', 'stdlib'])
def test_dumps(factory: str) -> None:
import pytest
# cant use a namedtuple here, since the default json.dump serializer
orig_dumps = globals()['dumps'] # hack to prevent error from using local variable before declaring
def dumps(*args, **kwargs) -> str:
kwargs['_prefer_factory'] = factory
return orig_dumps(*args, **kwargs)
import json as json_builtin # dont cause possible conflicts with module code
# can't use a namedtuple here, since the default json.dump serializer
# serializes namedtuples as tuples, which become arrays
# just test with an array of mixed objects
X = [5, datetime.timedelta(seconds=5.0)]
@ -166,36 +206,12 @@ def test_serialize_fallback() -> None:
# the lru_cache'd warning may have already been sent,
# so checking may be nondeterministic?
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = jsn.loads(dumps(X))
res = json_builtin.loads(dumps(X))
assert res == [5, 5.0]
# this needs to be defined here to prevent a mypy bug
# see https://github.com/python/mypy/issues/7281
class _A(NamedTuple):
x: int
y: float
def test_nt_serialize() -> None:
import json as jsn # dont cause possible conflicts with module code
import orjson # import to make sure this is installed
res: str = dumps(_A(x=1, y=2.0))
assert res == '{"x":1,"y":2.0}'
# test orjson option kwarg
data = {datetime.date(year=1970, month=1, day=1): 5}
res = jsn.loads(dumps(data, option=orjson.OPT_NON_STR_KEYS))
assert res == {'1970-01-01': 5}
def test_default_serializer() -> None:
import pytest
import json as jsn # dont cause possible conflicts with module code
class Unserializable:
def __init__(self, x: int):
self.x = x
@ -209,17 +225,37 @@ def test_default_serializer() -> None:
def _serialize(self) -> Any:
return {"x": self.x, "y": self.y}
res = jsn.loads(dumps(WithUnderscoreSerialize(6)))
res = json_builtin.loads(dumps(WithUnderscoreSerialize(6)))
assert res == {"x": 6, "y": 6.0}
# test passing additional 'default' func
def _serialize_with_default(o: Any) -> Any:
if isinstance(o, Unserializable):
return {"x": o.x, "y": o.y}
raise TypeError("Couldnt serialize")
raise TypeError("Couldn't serialize")
# this serializes both Unserializable, which is a custom type otherwise
# not handled, and timedelta, which is handled by the '_default_encode'
# in the 'wrapped_default' function
res2 = jsn.loads(dumps(Unserializable(10), default=_serialize_with_default))
res2 = json_builtin.loads(dumps(Unserializable(10), default=_serialize_with_default))
assert res2 == {"x": 10, "y": 10.0}
if factory == 'orjson':
import orjson
# test orjson option kwarg
data = {datetime.date(year=1970, month=1, day=1): 5}
res2 = json_builtin.loads(dumps(data, option=orjson.OPT_NON_STR_KEYS))
assert res2 == {'1970-01-01': 5}
@parametrize('factory', ['orjson', 'simplejson'])
def test_dumps_namedtuple(factory: str) -> None:
import json as json_builtin # dont cause possible conflicts with module code
class _A(NamedTuple):
x: int
y: float
res: str = dumps(_A(x=1, y=2.0), _prefer_factory=factory)
assert json_builtin.loads(res) == {'x': 1, 'y': 2.0}

View file

@ -3,9 +3,12 @@ Decorator to gracefully handle importing a data source, or warning
and yielding nothing (or a default) when its not available
"""
from functools import wraps
from typing import Any, Iterator, TypeVar, Callable, Optional, Iterable
from __future__ import annotations
import warnings
from collections.abc import Iterable, Iterator
from functools import wraps
from typing import Any, Callable, TypeVar
from .warnings import medium
@ -26,8 +29,8 @@ _DEFAULT_ITR = ()
def import_source(
*,
default: Iterable[T] = _DEFAULT_ITR,
module_name: Optional[str] = None,
help_url: Optional[str] = None,
module_name: str | None = None,
help_url: str | None = None,
) -> Callable[..., Callable[..., Iterator[T]]]:
"""
doesn't really play well with types, but is used to catch
@ -50,6 +53,7 @@ def import_source(
except (ImportError, AttributeError) as err:
from . import core_config as CC
from .error import warn_my_config_import_error
suppressed_in_conf = False
if module_name is not None and CC.config._is_module_active(module_name) is False:
suppressed_in_conf = True
@ -61,16 +65,18 @@ def import_source(
warnings.warn(f"""If you don't want to use this module, to hide this message, add '{module_name}' to your core config disabled_modules in your config, like:
class core:
disabled_modules = [{repr(module_name)}]
""")
disabled_modules = [{module_name!r}]
""", stacklevel=1)
# try to check if this is a config error or based on dependencies not being installed
if isinstance(err, (ImportError, AttributeError)):
matched_config_err = warn_my_config_import_error(err, help_url=help_url)
matched_config_err = warn_my_config_import_error(err, module_name=module_name, help_url=help_url)
# if we determined this wasn't a config error, and it was an attribute error
# it could be *any* attribute error -- we should raise this since its otherwise a fatal error
# from some code in the module failing
if not matched_config_err and isinstance(err, AttributeError):
raise err
yield from default
return wrapper
return decorator

View file

@ -1,16 +1,19 @@
from .common import assert_subpackage; assert_subpackage(__name__)
from __future__ import annotations
from .internal import assert_subpackage # noqa: I001
assert_subpackage(__name__)
from contextlib import contextmanager
from pathlib import Path
import shutil
import sqlite3
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from tempfile import TemporaryDirectory
from typing import Tuple, Any, Iterator, Callable, Optional, Union
from typing import Any, Callable, Literal, Union, overload
from .common import PathIsh, assert_never
from .compat import Literal
from .common import PathIsh
from .compat import assert_never
def sqlite_connect_immutable(db: PathIsh) -> sqlite3.Connection:
@ -22,7 +25,8 @@ def test_sqlite_connect_immutable(tmp_path: Path) -> None:
with sqlite3.connect(db) as conn:
conn.execute('CREATE TABLE testtable (col)')
import pytest # type: ignore
import pytest
with pytest.raises(sqlite3.OperationalError, match='readonly database'):
with sqlite_connect_immutable(db) as conn:
conn.execute('DROP TABLE testtable')
@ -34,15 +38,17 @@ def test_sqlite_connect_immutable(tmp_path: Path) -> None:
SqliteRowFactory = Callable[[sqlite3.Cursor, sqlite3.Row], Any]
def dict_factory(cursor, row):
fields = [column[0] for column in cursor.description]
return {key: value for key, value in zip(fields, row)}
return dict(zip(fields, row))
Factory = Union[SqliteRowFactory, Literal['row', 'dict']]
@contextmanager
def sqlite_connection(db: PathIsh, *, immutable: bool=False, row_factory: Optional[Factory]=None) -> Iterator[sqlite3.Connection]:
def sqlite_connection(db: PathIsh, *, immutable: bool = False, row_factory: Factory | None = None) -> Iterator[sqlite3.Connection]:
dbp = f'file:{db}'
# https://www.sqlite.org/draft/uri.html#uriimmutable
if immutable:
@ -86,45 +92,88 @@ def sqlite_copy_and_open(db: PathIsh) -> sqlite3.Connection:
for p in tocopy:
shutil.copy(p, tdir / p.name)
with sqlite3.connect(str(tdir / dp.name)) as conn:
from .compat import sqlite_backup
sqlite_backup(source=conn, dest=dest)
conn.backup(target=dest)
conn.close()
return dest
# NOTE hmm, so this kinda works
# V = TypeVar('V', bound=Tuple[Any, ...])
# def select(cols: V, rest: str, *, db: sqlite3.Connetion) -> Iterator[V]:
# def select(cols: V, rest: str, *, db: sqlite3.Connection) -> Iterator[V]:
# but sadly when we pass columns (Tuple[str, ...]), it seems to bind this type to V?
# and then the return type ends up as Iterator[Tuple[str, ...]], which isn't desirable :(
# a bit annoying to have this copy-pasting, but hopefully not a big issue
from typing import overload
# fmt: off
@overload
def select(cols: Tuple[str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any ]]: ...
def select(cols: tuple[str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any ]]: ...
@overload
def select(cols: Tuple[str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any ]]: ...
def select(cols: tuple[str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any ]]: ...
def select(cols: tuple[str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any, Any ]]: ...
def select(cols: tuple[str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any, Any, Any ]]: ...
def select(cols: tuple[str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any, Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any, Any, Any, Any ]]: ...
def select(cols: tuple[str, str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any, Any, Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any, Any, Any, Any, Any ]]: ...
def select(cols: tuple[str, str, str, str, str, str, str ], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any, Any, Any, Any, Any ]]: ...
@overload
def select(cols: Tuple[str, str, str, str, str, str, str, str], rest: str, *, db: sqlite3.Connection) -> \
Iterator[Tuple[Any, Any, Any, Any, Any, Any, Any, Any]]: ...
def select(cols: tuple[str, str, str, str, str, str, str, str], rest: str, *, db: sqlite3.Connection) -> \
Iterator[tuple[Any, Any, Any, Any, Any, Any, Any, Any]]: ...
# fmt: on
def select(cols, rest, *, db):
# db arg is last cause that results in nicer code formatting..
return db.execute('SELECT ' + ','.join(cols) + ' ' + rest)
class SqliteTool:
def __init__(self, connection: sqlite3.Connection) -> None:
self.connection = connection
def _get_sqlite_master(self) -> dict[str, str]:
res = {}
for c in self.connection.execute('SELECT name, type FROM sqlite_master'):
[name, type_] = c
assert type_ in {'table', 'index', 'view', 'trigger'}, (name, type_) # just in case
res[name] = type_
return res
def get_table_names(self) -> list[str]:
master = self._get_sqlite_master()
res = []
for name, type_ in master.items():
if type_ != 'table':
continue
res.append(name)
return res
def get_table_schema(self, name: str) -> dict[str, str]:
"""
Returns map from column name to column type
NOTE: Sometimes this doesn't work if the db has some extensions (e.g. happens for facebook apps)
In this case you might still be able to use get_table_names
"""
schema: dict[str, str] = {}
for row in self.connection.execute(f'PRAGMA table_info(`{name}`)'):
col = row[1]
type_ = row[2]
# hmm, somewhere between 3.34.1 and 3.37.2, sqlite started normalising type names to uppercase
# let's do this just in case since python < 3.10 are using the old version
# e.g. it could have returned 'blob' and that would confuse blob check (see _check_allowed_blobs)
type_ = type_.upper()
schema[col] = type_
return schema
def get_table_schemas(self) -> dict[str, dict[str, str]]:
return {name: self.get_table_schema(name) for name in self.get_table_names()}

View file

@ -1,41 +1,219 @@
'''
Helpers for hpi doctor/stats functionality.
'''
import collections
from __future__ import annotations
import collections.abc
import importlib
import inspect
import sys
import typing
from typing import Optional, Callable, Any, Iterator, Sequence, Dict, List
from collections.abc import Iterable, Iterator, Sequence
from contextlib import contextmanager
from datetime import datetime
from pathlib import Path
from types import ModuleType
from typing import (
Any,
Callable,
Protocol,
cast,
)
from .common import StatsFun, Stats, stat
from .types import asdict
Stats = dict[str, Any]
class StatsFun(Protocol):
def __call__(self, *, quick: bool = False) -> Stats: ...
# global state that turns on/off quick stats
# can use the 'quick_stats' contextmanager
# to enable/disable this in cli so that module 'stats'
# functions don't have to implement custom 'quick' logic
QUICK_STATS = False
# in case user wants to use the stats functions/quick option
# elsewhere -- can use this decorator instead of editing
# the global state directly
@contextmanager
def quick_stats():
global QUICK_STATS
prev = QUICK_STATS
try:
QUICK_STATS = True
yield
finally:
QUICK_STATS = prev
def stat(
func: Callable[[], Iterable[Any]] | Iterable[Any],
*,
quick: bool = False,
name: str | None = None,
) -> Stats:
"""
Extracts various statistics from a passed iterable/callable, e.g.:
- number of items
- first/last item
- timestamps associated with first/last item
If quick is set, then only first 100 items of the iterable will be processed
"""
if callable(func):
fr = func()
if hasattr(fr, '__enter__') and hasattr(fr, '__exit__'):
# context managers has Iterable type, but they aren't data providers
# sadly doesn't look like there is a way to tell from typing annotations
# Ideally we'd detect this in is_data_provider...
# but there is no way of knowing without actually calling it first :(
return {}
fname = func.__name__
else:
# meh. means it's just a list.. not sure how to generate a name then
fr = func
fname = f'unnamed_{id(fr)}'
type_name = type(fr).__name__
extras = {}
if type_name == 'DataFrame':
# dynamic, because pandas is an optional dependency..
df = cast(Any, fr) # todo ugh, not sure how to annotate properly
df = df.reset_index()
fr = df.to_dict(orient='records')
dtypes = df.dtypes.to_dict()
extras['dtypes'] = dtypes
res = _stat_iterable(fr, quick=quick)
res.update(extras)
stat_name = name if name is not None else fname
return {
stat_name: res,
}
def test_stat() -> None:
# the bulk of testing is in test_stat_iterable
# works with 'anonymous' lists
res = stat([1, 2, 3])
[(name, v)] = res.items()
# note: name will be a little funny since anonymous list doesn't have one
assert v == {'count': 3}
#
# works with functions:
def fun():
return [4, 5, 6]
assert stat(fun) == {'fun': {'count': 3}}
#
# context managers are technically iterable
# , but usually we wouldn't want to compute stats for them
# this is mainly intended for guess_stats,
# since it can't tell whether the function is a ctx manager without calling it
@contextmanager
def cm():
yield 1
yield 3
assert stat(cm) == {} # type: ignore[arg-type]
#
# works with pandas dataframes
import numpy as np
import pandas as pd
def df() -> pd.DataFrame:
dates = pd.date_range(start='2024-02-10 08:00', end='2024-02-11 16:00', freq='5h')
return pd.DataFrame([f'value{i}' for i, _ in enumerate(dates)], index=dates, columns=['value'])
assert stat(df) == {
'df': {
'count': 7,
'dtypes': {
'index': np.dtype('<M8[ns]'),
'value': np.dtype('O'),
},
'first': pd.Timestamp('2024-02-10 08:00'),
'last': pd.Timestamp('2024-02-11 14:00'),
},
}
#
def get_stats(module_name: str, *, guess: bool = False) -> StatsFun | None:
stats: StatsFun | None = None
try:
module = importlib.import_module(module_name)
except Exception:
return None
stats = getattr(module, 'stats', None)
if stats is None:
stats = guess_stats(module)
return stats
# TODO maybe could be enough to annotate OUTPUTS or something like that?
# then stats could just use them as hints?
def guess_stats(module_name: str, quick: bool=False) -> Optional[StatsFun]:
providers = guess_data_providers(module_name)
def guess_stats(module: ModuleType) -> StatsFun | None:
"""
If the module doesn't have explicitly defined 'stat' function,
this is used to try to guess what could be included in stats automatically
"""
providers = _guess_data_providers(module)
if len(providers) == 0:
return None
def auto_stats() -> Stats:
return {k: stat(v, quick=quick) for k, v in providers.items()}
def auto_stats(*, quick: bool = False) -> Stats:
res = {}
for k, v in providers.items():
res.update(stat(v, quick=quick, name=k))
return res
return auto_stats
def guess_data_providers(module_name: str) -> Dict[str, Callable]:
module = importlib.import_module(module_name)
def test_guess_stats() -> None:
import my.core.tests.auto_stats as M
auto_stats = guess_stats(M)
assert auto_stats is not None
res = auto_stats(quick=False)
assert res == {
'inputs': {
'count': 3,
'first': 'file1.json',
'last': 'file3.json',
},
'iter_data': {
'count': 9,
'first': datetime(2020, 1, 1, 1, 1, 1),
'last': datetime(2020, 1, 3, 1, 1, 1),
},
}
def _guess_data_providers(module: ModuleType) -> dict[str, Callable]:
mfunctions = inspect.getmembers(module, inspect.isfunction)
return {k: v for k, v in mfunctions if is_data_provider(v)}
# todo how to exclude deprecated stuff?
# todo how to exclude deprecated data providers?
def is_data_provider(fun: Any) -> bool:
"""
Criteria for being a "data provider":
1. returns iterable or something like that
2. takes no arguments? (otherwise not callable by stats anyway?)
3. doesn't start with an underscore (those are probably helper functions?)
4. functions isnt the 'inputs' function (or ends with '_inputs')
"""
# todo maybe for 2 allow default arguments? not sure
# one example which could benefit is my.pdfs
@ -48,19 +226,23 @@ def is_data_provider(fun: Any) -> bool:
return False
# has at least one argument without default values
if len(list(sig_required_params(sig))) > 0:
if len(list(_sig_required_params(sig))) > 0:
return False
if hasattr(fun, '__name__'):
# probably a helper function?
if fun.__name__.startswith('_'):
return False
# ignore def inputs; something like comment_inputs or backup_inputs should also be ignored
if fun.__name__ == 'inputs' or fun.__name__.endswith('_inputs'):
# inspect.signature might return str instead of a proper type object
# if from __future__ import annotations is used
# so best to rely on get_type_hints (which evals the annotations)
type_hints = typing.get_type_hints(fun)
return_type = type_hints.get('return')
if return_type is None:
return False
return_type = sig.return_annotation
return type_is_iterable(return_type)
return _type_is_iterable(return_type)
def test_is_data_provider() -> None:
@ -71,34 +253,42 @@ def test_is_data_provider() -> None:
def no_return_type():
return [1, 2, 3]
assert not idp(no_return_type)
lam = lambda: [1, 2]
assert not idp(lam)
def has_extra_args(count) -> List[int]:
def has_extra_args(count) -> list[int]:
return list(range(count))
assert not idp(has_extra_args)
def has_return_type() -> Sequence[str]:
return ['a', 'b', 'c']
assert idp(has_return_type)
def _helper_func() -> Iterator[Any]:
yield 1
assert not idp(_helper_func)
def inputs() -> Iterator[Any]:
yield 1
assert not idp(inputs)
assert idp(inputs)
def producer_inputs() -> Iterator[Any]:
yield 1
assert not idp(producer_inputs)
assert idp(producer_inputs)
# return any parameters the user is required to provide - those which don't have default values
def sig_required_params(sig: inspect.Signature) -> Iterator[inspect.Parameter]:
def _sig_required_params(sig: inspect.Signature) -> Iterator[inspect.Parameter]:
"""
Returns parameters the user is required to provide - e.g. ones that don't have default values
"""
for param in sig.parameters.values():
if param.default == inspect.Parameter.empty:
yield param
@ -108,24 +298,24 @@ def test_sig_required_params() -> None:
def x() -> int:
return 5
assert len(list(sig_required_params(inspect.signature(x)))) == 0
assert len(list(_sig_required_params(inspect.signature(x)))) == 0
def y(arg: int) -> int:
return arg
assert len(list(sig_required_params(inspect.signature(y)))) == 1
assert len(list(_sig_required_params(inspect.signature(y)))) == 1
# from stats perspective, this should be treated as a data provider as well
# could be that the default value to the data provider is the 'default'
# path to use for inputs/a function to provide input data
def z(arg: int = 5) -> int:
return arg
assert len(list(sig_required_params(inspect.signature(z)))) == 0
assert len(list(_sig_required_params(inspect.signature(z)))) == 0
def type_is_iterable(type_spec) -> bool:
if sys.version_info[1] < 8:
# there is no get_origin before 3.8, and retrofitting gonna be a lot of pain
return any(x in str(type_spec) for x in ['List', 'Sequence', 'Iterable', 'Iterator'])
def _type_is_iterable(type_spec) -> bool:
origin = typing.get_origin(type_spec)
if origin is None:
return False
@ -142,14 +332,139 @@ def type_is_iterable(type_spec) -> bool:
# todo docstring test?
def test_type_is_iterable() -> None:
from typing import List, Sequence, Iterable, Dict, Any
fun = type_is_iterable
fun = _type_is_iterable
assert not fun(None)
assert not fun(int)
assert not fun(Any)
assert not fun(Dict[int, int])
assert not fun(dict[int, int])
assert fun(List[int])
assert fun(Sequence[Dict[str, str]])
assert fun(list[int])
assert fun(Sequence[dict[str, str]])
assert fun(Iterable[Any])
def _stat_item(item):
if item is None:
return None
if isinstance(item, Path):
return str(item)
return _guess_datetime(item)
def _stat_iterable(it: Iterable[Any], *, quick: bool = False) -> Stats:
from more_itertools import first, ilen, take
# todo not sure if there is something in more_itertools to compute this?
total = 0
errors = 0
first_item = None
last_item = None
def funcit():
nonlocal errors, first_item, last_item, total
for x in it:
total += 1
if isinstance(x, Exception):
errors += 1
else:
last_item = x
if first_item is None:
first_item = x
yield x
eit = funcit()
count: Any
if quick or QUICK_STATS:
initial = take(100, eit)
count = len(initial)
if first(eit, None) is not None: # todo can actually be none...
# haven't exhausted
count = f'{count}+'
else:
count = ilen(eit)
res = {
'count': count,
}
if total == 0:
# not sure but I guess a good balance? wouldn't want to throw early here?
res['warning'] = 'THE ITERABLE RETURNED NO DATA'
if errors > 0:
res['errors'] = errors
if (stat_first := _stat_item(first_item)) is not None:
res['first'] = stat_first
if (stat_last := _stat_item(last_item)) is not None:
res['last'] = stat_last
return res
def test_stat_iterable() -> None:
from datetime import datetime, timedelta, timezone
from typing import NamedTuple
dd = datetime.fromtimestamp(123, tz=timezone.utc)
day = timedelta(days=3)
class X(NamedTuple):
x: int
d: datetime
def it():
yield RuntimeError('oops!')
for i in range(2):
yield X(x=i, d=dd + day * i)
yield RuntimeError('bad!')
for i in range(3):
yield X(x=i * 10, d=dd + day * (i * 10))
yield X(x=123, d=dd + day * 50)
res = _stat_iterable(it())
assert res['count'] == 1 + 2 + 1 + 3 + 1
assert res['errors'] == 1 + 1
assert res['last'] == dd + day * 50
# experimental, not sure about it..
def _guess_datetime(x: Any) -> datetime | None:
# todo hmm implement without exception..
try:
d = asdict(x)
except: # noqa: E722 bare except
return None
for v in d.values():
if isinstance(v, datetime):
return v
return None
def test_guess_datetime() -> None:
from dataclasses import dataclass
from typing import NamedTuple
from .compat import fromisoformat
dd = fromisoformat('2021-02-01T12:34:56Z')
class A(NamedTuple):
x: int
class B(NamedTuple):
x: int
created: datetime
assert _guess_datetime(A(x=4)) is None
assert _guess_datetime(B(x=4, created=dd)) == dd
@dataclass
class C:
a: datetime
x: int
assert _guess_datetime(C(a=dd, x=435)) == dd
# TODO not sure what to return when multiple datetime fields?
# TODO test @property?

View file

@ -1,20 +1,22 @@
from __future__ import annotations
import atexit
import os
import shutil
import sys
import tarfile
import tempfile
import zipfile
import atexit
from typing import Sequence, Generator, List, Union, Tuple
from collections.abc import Generator, Sequence
from contextlib import contextmanager
from pathlib import Path
from .common import LazyLogger
from .logging import make_logger
logger = make_logger(__name__, level="info")
logger = LazyLogger(__name__, level="info")
def _structure_exists(base_dir: Path, paths: Sequence[str], partial: bool = False) -> bool:
def _structure_exists(base_dir: Path, paths: Sequence[str], *, partial: bool = False) -> bool:
"""
Helper function for match_structure to check if
all subpaths exist at some base directory
@ -36,17 +38,18 @@ def _structure_exists(base_dir: Path, paths: Sequence[str], partial: bool = Fals
ZIP_EXT = {".zip"}
TARGZ_EXT = {".tar.gz"}
@contextmanager
def match_structure(
base: Path,
expected: Union[str, Sequence[str]],
expected: str | Sequence[str],
*,
partial: bool = False,
) -> Generator[Tuple[Path, ...], None, None]:
) -> Generator[tuple[Path, ...], None, None]:
"""
Given a 'base' directory or zipfile, recursively search for one or more paths that match the
Given a 'base' directory or archive (zip/tar.gz), recursively search for one or more paths that match the
pattern described in 'expected'. That can be a single string, or a list
of relative paths (as strings) you expect at the same directory.
@ -54,12 +57,12 @@ def match_structure(
expected be present, not all of them.
This reduces the chances of the user misconfiguring gdpr exports, e.g.
if they zipped the folders instead of the parent directory or vice-versa
if they archived the folders instead of the parent directory or vice-versa
When this finds a matching directory structure, it stops searching in that subdirectory
and continues onto other possible subdirectories which could match
If base is a zipfile, this extracts the zipfile into a temporary directory
If base is an archive, this extracts it into a temporary directory
(configured by core_config.config.get_tmp_dir), and then searches the extracted
folder for matching structures
@ -69,21 +72,21 @@ def match_structure(
export_dir
exp_2020
   channel_data
      data1
      data2
   index.json
   messages
      messages.csv
   profile
   settings.json
channel_data
data1
data2
index.json
messages
messages.csv
profile
settings.json
exp_2021
channel_data
   data1
   data2
data1
data2
index.json
messages
   messages.csv
messages.csv
profile
settings.json
@ -95,12 +98,12 @@ def match_structure(
This doesn't require an exhaustive list of expected values, but its a good idea to supply
a complete picture of the expected structure to avoid false-positives
This does not recursively unzip zipfiles in the subdirectories,
it only unzips into a temporary directory if 'base' is a zipfile
This does not recursively decompress archives in the subdirectories,
it only unpacks into a temporary directory if 'base' is an archive
A common pattern for using this might be to use get_files to get a list
of zipfiles or top-level gdpr export directories, and use match_structure
to search the resulting paths for a export structure you're expecting
of archives or top-level gdpr export directories, and use match_structure
to search the resulting paths for an export structure you're expecting
"""
from . import core_config as CC
@ -110,28 +113,37 @@ def match_structure(
expected = (expected,)
is_zip: bool = base.suffix in ZIP_EXT
is_targz: bool = any(base.name.endswith(suffix) for suffix in TARGZ_EXT)
searchdir: Path = base.absolute()
try:
# if the file given by the user is a zipfile, create a temporary
# directory and extract the zipfile to that temporary directory
# if the file given by the user is an archive, create a temporary
# directory and extract it to that temporary directory
#
# this temporary directory is removed in the finally block
if is_zip:
if is_zip or is_targz:
# sanity check before we start creating directories/rm-tree'ing things
assert base.exists(), f"zipfile at {base} doesn't exist"
assert base.exists(), f"archive at {base} doesn't exist"
searchdir = Path(tempfile.mkdtemp(dir=tdir))
zf = zipfile.ZipFile(base)
if is_zip:
# base might already be a ZipPath, and str(base) would end with /
zf = zipfile.ZipFile(str(base).rstrip('/'))
zf.extractall(path=str(searchdir))
elif is_targz:
with tarfile.open(str(base)) as tar:
# filter is a security feature, will be required param in later python version
mfilter = {'filter': 'data'} if sys.version_info[:2] >= (3, 12) else {}
tar.extractall(path=str(searchdir), **mfilter) # type: ignore[arg-type]
else:
raise RuntimeError("can't happen")
else:
if not searchdir.is_dir():
raise NotADirectoryError(f"Expected either a zipfile or a directory, received {searchdir}")
raise NotADirectoryError(f"Expected either a zip/tar.gz archive or a directory, received {searchdir}")
matches: List[Path] = []
possible_targets: List[Path] = [searchdir]
matches: list[Path] = []
possible_targets: list[Path] = [searchdir]
while len(possible_targets) > 0:
p = possible_targets.pop(0)
@ -151,9 +163,9 @@ def match_structure(
finally:
if is_zip:
if is_zip or is_targz:
# make sure we're not mistakenly deleting data
assert str(searchdir).startswith(str(tdir)), f"Expected the temporary directory for extracting zip to start with the temporary directory prefix ({tdir}), found {searchdir}"
assert str(searchdir).startswith(str(tdir)), f"Expected the temporary directory for extracting archive to start with the temporary directory prefix ({tdir}), found {searchdir}"
shutil.rmtree(str(searchdir))
@ -162,7 +174,7 @@ def warn_leftover_files() -> None:
from . import core_config as CC
base_tmp: Path = CC.config.get_tmp_dir()
leftover: List[Path] = list(base_tmp.iterdir())
leftover: list[Path] = list(base_tmp.iterdir())
if leftover:
logger.debug(f"at exit warning: Found leftover files in temporary directory '{leftover}'. this may be because you have multiple hpi processes running -- if so this can be ignored")

View file

@ -0,0 +1,3 @@
# hmm, sadly pytest --import-mode importlib --pyargs my.core.tests doesn't work properly without __init__.py
# although it works if you run either my.core or my.core.tests.sqlite (for example) directly
# so if it gets in the way could get rid of this later?

View file

@ -0,0 +1,37 @@
"""
Helper 'module' for test_guess_stats
"""
from collections.abc import Iterable, Iterator, Sequence
from contextlib import contextmanager
from dataclasses import dataclass
from datetime import datetime, timedelta
from pathlib import Path
@dataclass
class Item:
id: str
dt: datetime
source: Path
def inputs() -> Sequence[Path]:
return [
Path('file1.json'),
Path('file2.json'),
Path('file3.json'),
]
def iter_data() -> Iterable[Item]:
dt = datetime.fromisoformat('2020-01-01 01:01:01')
for path in inputs():
for i in range(3):
yield Item(id=str(i), dt=dt + timedelta(days=i), source=path)
@contextmanager
def some_contextmanager() -> Iterator[str]:
# this shouldn't end up in guess_stats because context manager is not a data provider
yield 'hello'

32
my/core/tests/common.py Normal file
View file

@ -0,0 +1,32 @@
from __future__ import annotations
import os
from collections.abc import Iterator
from contextlib import contextmanager
import pytest
V = 'HPI_TESTS_USES_OPTIONAL_DEPS'
# TODO use it for serialize tests that are using simplejson/orjson?
skip_if_uses_optional_deps = pytest.mark.skipif(
V not in os.environ,
reason=f'test only works when optional dependencies are installed. Set env variable {V}=true to override.',
)
# TODO maybe move to hpi core?
@contextmanager
def tmp_environ_set(key: str, value: str | None) -> Iterator[None]:
prev_value = os.environ.get(key)
if value is None:
os.environ.pop(key, None)
else:
os.environ[key] = value
try:
yield
finally:
if prev_value is None:
os.environ.pop(key, None)
else:
os.environ[key] = prev_value

Some files were not shown because too many files have changed in this diff Show more