my.reddit: refactor into module that supports pushshift/gdpr (#179)

* initial pushshift/rexport merge implementation, using id for merging
* smarter module deprecation warning using regex
* add `RedditBase` from promnesia
* `import_source` helper for gracefully handing mixin data sources
This commit is contained in:
Sean Breckenridge 2021-10-31 13:39:04 -07:00 committed by GitHub
parent b54ec0d7f1
commit 8422c6e420
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
15 changed files with 374 additions and 58 deletions

View file

@ -355,7 +355,7 @@ The only thing you need to do is to tell it where to find the files on your disk
Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!
- first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HP
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI
It doesn't deal with all with the complexities of API interactions.
Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.
@ -368,19 +368,18 @@ As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetc
: ⇓⇓⇓
: |💾 /backups/reddit/*.json |
: ⇓⇓⇓
: HPI (my.reddit)
: HPI (my.reddit.rexport)
: ⇓⇓⇓
: < python interface >
So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.
But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.
Another *design decision* of HPI is to use existing code and libraries as much as possible, so we also specify a path to =rexport= repository in the config.
(note: in the future it's possible that rexport will be installed via PIP, I just haven't had time for it so far).
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
** Twitter
Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.