my.reddit: refactor into module that supports pushshift/gdpr (#179)

* initial pushshift/rexport merge implementation, using id for merging * smarter module deprecation warning using regex * add `RedditBase` from promnesia * `import_source` helper for gracefully handing mixin data sources
2021-10-31 13:39:04 -07:00 · 2021-10-31 13:39:04 -07:00 · 8422c6e420
commit 8422c6e420
parent b54ec0d7f1
15 changed files with 374 additions and 58 deletions
--- a/doc/MODULES.org
+++ b/doc/MODULES.org
@ -74,7 +74,6 @@ import importlib
 modules = [
    ('google'         , 'my.google.takeout.paths'),
    ('hypothesis'     , 'my.hypothesis'          ),
-    ('reddit'         , 'my.reddit'              ),
    ('pocket'         , 'my.pocket'              ),
    ('twint'          , 'my.twitter.twint'       ),
    ('twitter_archive', 'my.twitter.archive'     ),
@ -144,14 +143,25 @@ for cls, p in modules:

    Reddit data: saved items/comments/upvotes/etc.

+    # Note: can't be generated as easily since this is a nested configuration object
    #+begin_src python
    class reddit:
-        '''
-        Uses [[https://github.com/karlicoss/rexport][rexport]] output.
-        '''
+        class rexport:
+            '''
+            Uses [[https://github.com/karlicoss/rexport][rexport]] output.
+            '''
+
+            # path[s]/glob to the exported JSON data
+            export_path: Paths
+
+        class pushshift:
+            '''
+            Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
+            '''
+
+            # path[s]/glob to the exported JSON data
+            export_path: Paths

-        # path[s]/glob to the exported JSON data
-        export_path: Paths
    #+end_src
 ** [[file:../my/pocket.py][my.pocket]]

--- a/doc/MODULE_DESIGN.org
+++ b/doc/MODULE_DESIGN.org
@ -76,11 +76,11 @@ A related concern is how to structure namespace packages to allow users to easil

 - In addition, you can *override* the builtin HPI modules too:

-  : custom_reddit_overlay
+  : custom_lastfm_overlay
  : └── my
-  :     └──reddit.py
+  :     └──lastfm.py

-  Now if you add =custom_reddit_overlay= *in front* of ~PYTHONPATH~, all the downstream scripts using =my.reddit= will load it from =custom_reddit_overlay= instead.
+  Now if you add =custom_lastfm_overlay= [[https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH][*in front* of ~PYTHONPATH~]], all the downstream scripts using =my.lastfm= will load it from =custom_lastfm_overlay= instead.

  This could be useful to monkey patch some behaviours, or dynamically add some extra data sources -- anything that comes to your mind.
  You can check [[https://github.com/karlicoss/hpi-personal-overlay/blob/7fca8b1b6031bf418078da2d8be70fd81d2d8fa0/src/my/calendar/holidays.py#L1-L14][my.calendar.holidays]] in my personal overlay as a reference.
@ -99,15 +99,15 @@ In order to do that, like stated above, you could edit the ~PYTHONPATH~ variable

 In the context of HPI, it being a namespace package means you can have a local clone of this repository, and your own 'HPI' modules in a separate folder, which then get combined into the ~my~ package.

-As an example, say you were trying to override the ~my.reddit~ file, to include some new feature. You could create a new file hierarchy like:
+As an example, say you were trying to override the ~my.lastfm~ file, to include some new feature. You could create a new file hierarchy like:

 : .
 : ├── my
-: │   ├── reddit.py
+: │   ├── lastfm.py
 : │   └── some_new_module.py
 : └── setup.py

-Where ~reddit.py~ is your version of ~my.reddit~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:
+Where ~lastfm.py~ is your version of ~my.lastfm~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:

    #+begin_src python
    from setuptools import setup, find_namespace_packages
@ -121,9 +121,9 @@ Where ~reddit.py~ is your version of ~my.reddit~, which you've copied from this
    )
    #+end_src

-Then, running ~pip3 install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~reddit.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.
+Then, running ~python3 -m pip install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~lastfm.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.

-If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_reddit_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/seanbreckenridge/reorder_editable][reorder_editable]] repository.
+If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/seanbreckenridge/reorder_editable][reorder_editable]] repository.

 There is no limit to how many directories you could install into a single namespace package, which could be a possible way for people to install additional HPI modules, without worrying about the module count here becoming too large to manage.

--- a/doc/SETUP.org
+++ b/doc/SETUP.org
@ -355,7 +355,7 @@ The only thing you need to do is to tell it where to find the files on your disk
 Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!

 - first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HP
+- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI

  It doesn't deal with all with the complexities of API interactions.
  Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.
@ -368,19 +368,18 @@ As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetc
 :              ⇓⇓⇓
 :  |💾 /backups/reddit/*.json |
 :              ⇓⇓⇓
-:      HPI (my.reddit)
+:      HPI (my.reddit.rexport)
 :              ⇓⇓⇓
 :     < python interface >

 So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.

 But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.
-Another *design decision* of HPI is to use existing code and libraries as much as possible, so we also specify a path to =rexport= repository in the config.
-
-(note: in the future it's possible that rexport will be installed via PIP, I just haven't had time for it so far).

 Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.

+Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
+
 ** Twitter

 Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.