my.reddit: refactor into module that supports pushshift/gdpr (#179)

* initial pushshift/rexport merge implementation, using id for merging
* smarter module deprecation warning using regex
* add `RedditBase` from promnesia
* `import_source` helper for gracefully handing mixin data sources
This commit is contained in:
Sean Breckenridge 2021-10-31 13:39:04 -07:00 committed by GitHub
parent b54ec0d7f1
commit 8422c6e420
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
15 changed files with 374 additions and 58 deletions

View file

@ -74,7 +74,6 @@ import importlib
modules = [
('google' , 'my.google.takeout.paths'),
('hypothesis' , 'my.hypothesis' ),
('reddit' , 'my.reddit' ),
('pocket' , 'my.pocket' ),
('twint' , 'my.twitter.twint' ),
('twitter_archive', 'my.twitter.archive' ),
@ -144,14 +143,25 @@ for cls, p in modules:
Reddit data: saved items/comments/upvotes/etc.
# Note: can't be generated as easily since this is a nested configuration object
#+begin_src python
class reddit:
'''
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
'''
class rexport:
'''
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
'''
# path[s]/glob to the exported JSON data
export_path: Paths
class pushshift:
'''
Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
'''
# path[s]/glob to the exported JSON data
export_path: Paths
# path[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/pocket.py][my.pocket]]

View file

@ -76,11 +76,11 @@ A related concern is how to structure namespace packages to allow users to easil
- In addition, you can *override* the builtin HPI modules too:
: custom_reddit_overlay
: custom_lastfm_overlay
: └── my
: └──reddit.py
: └──lastfm.py
Now if you add =custom_reddit_overlay= *in front* of ~PYTHONPATH~, all the downstream scripts using =my.reddit= will load it from =custom_reddit_overlay= instead.
Now if you add =custom_lastfm_overlay= [[https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH][*in front* of ~PYTHONPATH~]], all the downstream scripts using =my.lastfm= will load it from =custom_lastfm_overlay= instead.
This could be useful to monkey patch some behaviours, or dynamically add some extra data sources -- anything that comes to your mind.
You can check [[https://github.com/karlicoss/hpi-personal-overlay/blob/7fca8b1b6031bf418078da2d8be70fd81d2d8fa0/src/my/calendar/holidays.py#L1-L14][my.calendar.holidays]] in my personal overlay as a reference.
@ -99,15 +99,15 @@ In order to do that, like stated above, you could edit the ~PYTHONPATH~ variable
In the context of HPI, it being a namespace package means you can have a local clone of this repository, and your own 'HPI' modules in a separate folder, which then get combined into the ~my~ package.
As an example, say you were trying to override the ~my.reddit~ file, to include some new feature. You could create a new file hierarchy like:
As an example, say you were trying to override the ~my.lastfm~ file, to include some new feature. You could create a new file hierarchy like:
: .
: ├── my
: │   ├── reddit.py
: │   ├── lastfm.py
: │   └── some_new_module.py
: └── setup.py
Where ~reddit.py~ is your version of ~my.reddit~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:
Where ~lastfm.py~ is your version of ~my.lastfm~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:
#+begin_src python
from setuptools import setup, find_namespace_packages
@ -121,9 +121,9 @@ Where ~reddit.py~ is your version of ~my.reddit~, which you've copied from this
)
#+end_src
Then, running ~pip3 install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~reddit.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.
Then, running ~python3 -m pip install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~lastfm.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.
If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_reddit_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/seanbreckenridge/reorder_editable][reorder_editable]] repository.
If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/seanbreckenridge/reorder_editable][reorder_editable]] repository.
There is no limit to how many directories you could install into a single namespace package, which could be a possible way for people to install additional HPI modules, without worrying about the module count here becoming too large to manage.

View file

@ -355,7 +355,7 @@ The only thing you need to do is to tell it where to find the files on your disk
Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!
- first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HP
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI
It doesn't deal with all with the complexities of API interactions.
Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.
@ -368,19 +368,18 @@ As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetc
: ⇓⇓⇓
: |💾 /backups/reddit/*.json |
: ⇓⇓⇓
: HPI (my.reddit)
: HPI (my.reddit.rexport)
: ⇓⇓⇓
: < python interface >
So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.
But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.
Another *design decision* of HPI is to use existing code and libraries as much as possible, so we also specify a path to =rexport= repository in the config.
(note: in the future it's possible that rexport will be installed via PIP, I just haven't had time for it so far).
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
** Twitter
Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.