HPI/doc/SETUP.org

# TODO  FAQ??
Please don't be shy and raise issues if something in the instructions is unclear.
You'd be really helping me, I want to make the setup as straightforward as possible!

# update with org-make-toc
* TOC
:PROPERTIES:
:TOC:      :include all
:END:

:CONTENTS:
- [[#toc][TOC]]
- [[#few-notes][Few notes]]
- [[#install-main-hpi-package][Install main HPI package]]
  - [[#option-1-install-from-pip][option 1: install from PIP]]
  - [[#option-2-localeditable-install][option 2: local/editable install]]
  - [[#option-3-use-without-installing][option 3: use without installing]]
  - [[#appendix-optional-packages][appendix: optional packages]]
- [[#setting-up-modules][Setting up modules]]
  - [[#private-configuration-myconfig][private configuration (my.config)]]
  - [[#module-dependencies][module dependencies]]
- [[#troubleshooting][Troubleshooting]]
  - [[#common-issues][common issues]]
- [[#usage-examples][Usage examples]]
  - [[#end-to-end-roam-research-setup][End-to-end Roam Research setup]]
  - [[#polar][Polar]]
  - [[#google-takeout][Google Takeout]]
  - [[#kobo-reader][Kobo reader]]
  - [[#orger][Orger]]
    - [[#orger--polar][Orger + Polar]]
  - [[#demopy][demo.py]]
- [[#data-flow][Data flow]]
  - [[#polar-bookshelf][Polar Bookshelf]]
  - [[#google-takeout][Google Takeout]]
  - [[#reddit][Reddit]]
  - [[#twitter][Twitter]]
  - [[#connecting-to-other-apps][Connecting to other apps]]
- [[#addingmodifying-modules][Adding/modifying modules]]
:END:


* Few notes
I understand that people who'd like to use this may not be super familiar with Python, pip or generally unix, so here are some useful notes:

- only ~python >= 3.7~ is supported
- I'm using ~pip3~ command, but on your system you might only have ~pip~.

  If your ~pip --version~ says python 3, feel free to use ~pip~.

- If you have issues getting ~pip~ or ~pip3~ to work, it may be worth invoking the module instead using a fully qualified path, like ~python3 -m pip~ (e.g. ~python3 -m pip install --user ..~)

- similarly, I'm using =python3= in the documentation, but if your =python --version= says python3, it's okay to use =python=

- when you are using ~pip install~, [[https://stackoverflow.com/a/42989020/706389][always pass]] =--user=, and *never install third party packages with sudo* (unless you know what you are doing)
- throughout the guide I'm assuming the user config directory is =~/.config=, but it's *different on Mac/Windows*.

  See [[https://github.com/ActiveState/appdirs/blob/3fe6a83776843a46f20c2e5587afcffe05e03b39/appdirs.py#L187-L190][this]] if you're not sure what's your user config dir.

* Install main HPI package
This is a *required step*

You can choose one of the following options:

** option 1: install from [[https://pypi.org/project/HPI][PIP]]
This is the *easiest way*:

: pip3 install --user HPI

** option 2: local/editable install
This is convenient if you're planning to add new modules or change the existing ones.

1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
2. Go into the project directory: =cd /path/to/hpi=
2. Run  ~pip3 install --user -e .~

   This will install the package in 'editable mode'.
   It means that any changes to =/path/to/hpi= will be immediately reflected without need to reinstall anything.

   It's *extremely* convenient for developing and debugging.

** option 3: use without installing
This is less convenient, but gives you more control.

1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
2. Go into the project directory: =cd /path/to/hpi=
3. Install the dependencies: ~python3 setup.py --dependencies-only~
4. Use =with_my= script to get access to ~my.~ modules.

   For example:

   : /path/to/hpi/with_my python3 -c 'from my.pinboard import bookmarks; print(list(bookmarks()))'

   It's also convenient to put a symlink to =with_my= somewhere in your system path so you can run it from anywhere, or add an alias in your bashrc:

   : alias with_my='/path/to/hpi/with_my'

   After that, you can wrap your command in =with_my= to give it access to ~my.~ modules, e.g. see [[#usage-examples][examples]].

The benefit of this way is that you get a bit more control, explicitly allowing your scripts to use your data.

** appendix: optional packages
You can also install some optional packages

: pip3 install 'HPI[optional]'

They aren't necessary, but will improve your experience. At the moment these are:

- [[https://github.com/karlicoss/cachew][cachew]]: automatic caching library, which can greatly speedup data access
- [[https://github.com/metachris/logzero][logzero]]: a nice logging library, supporting colors
- [[https://github.com/ijl/orjson][orjson]]: a library for serializing data to JSON, used in ~my.core.serialize~ and the ~hpi query~ interface
- [[https://github.com/python/mypy][mypy]]: mypy is used for checking configs and troubleshooting

* Setting up modules
This is an *optional step* as few modules work without extra setup.
But it depends on the specific module.

See [[file:MODULES.org][MODULES]] to read documentation on specific modules that interest you.

You might also find interesting to read [[file:CONFIGURING.org][CONFIGURING]], where I'm
elaborating on some technical rationales behind the current configuration system.

** private configuration (=my.config=)
# TODO write about dynamic configuration
# todo add a command to edit config?? e.g. HPI config edit
If you're not planning to use private configuration (some modules don't need it) you can skip straight to the next step. Still, I'd recommend you to read anyway.

The configuration usually contains paths to the data on your disk, and some modules have extra settings.
The config is simply a *python package* (named =my.config=), expected to be in =~/.config/my=.
If you'd like to change the location of the =my.config= directory, you can set the =MY_CONFIG= environment variable. e.g. in your .bashrc add: ~export MY_CONFIG=$HOME/.my/~

Since it's a Python package, generally it's very *flexible* and there are many ways to set it up.

- *The simplest way*

  After installing HPI, run =hpi config create=.

  This will create an empty config file for you (usually, in =~/.config/my=), which you can edit. Example configuration:

  #+begin_src python
  import pytz # yes, you can use any Python stuff in the config

  class emfit:
      export_path = '/data/exports/emfit'
      tz = pytz.timezone('Europe/London')
      excluded_sids = []
      cache_path  = '/tmp/emfit.cache'

  class instapaper:
      export_path = '/data/exports/instapaper'

  class roamresearch:
      export_path = '/data/exports/roamresearch'
      username    = 'karlicoss'
  #+end_src

  To find out which attributes you need to specify:

  - check in [[file:MODULES.org][MODULES]]
  - check in [[file:../my/config.py][the default config stubs]]
  - if there is nothing there, the easiest is perhaps to skim through the module's code and search for =config.= uses.

    For example, if you search for =config.= in [[file:../my/emfit/__init__.py][emfit module]], you'll see that it's using =export_path=, =tz=, =excluded_sids= and =cache_path=.

  - or you can just try running them and fill in the attributes Python complains about!

    or run =hpi doctor my.modulename=

# TODO link to post about exports?
** module dependencies
Dependencies are different for specific modules you're planning to use, so it's hard to tell in advance what you'll need.

First thing you should try is just using the module; if it works -- great! If it doesn't (i.e. you get something like =ImportError=):

- try using =hpi module install modulename= (where =<modulename>= is something like =my.hypothesis=, etc.)

  This command uses [[https://github.com/karlicoss/HPI/search?l=Python&q=REQUIRES][REQUIRES]] declaration to install the dependencies.

- otherwise manually install missing packages via ~pip3 install --user~

  Also please feel free to report if the command above didn't install some dependencies!


* Troubleshooting
# todo replace with_my with it??

HPI comes with a command line tool that can help you detect potential issues. Run:

: hpi doctor
: # alternatively, for more output:
: hpi doctor --verbose

If you only have a few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.

If you're having issues with ~cachew~ or want to show logs to troubleshoot what may be happening, you can pass the debug flag (e.g., ~hpi --debug doctor my.module_name~) or set the ~HPI_LOGS~ environment variable (e.g., ~HPI_LOGS=debug hpi query my.module_name~) to print all logs, including the ~cachew~ dependencies. ~HPI_LOGS~ could also be used to silence ~info~ logs, like ~HPI_LOGS=warning hpi ...~

If you want ~HPI~ to autocomplete the module names for you, this comes with shell completion, see [[../misc/completion/][misc/completion]]

If you have any ideas on how to improve it, please let me know!

Here's a screenshot how it looks when everything is mostly good: [[https://user-images.githubusercontent.com/291333/82806066-f7dfe400-9e7c-11ea-8763-b3bee8ada308.png][link]].

If you experience issues, feel free to report, but please attach your:

- OS version
- python version: =python3 --version=
- HPI version: =pip3 show HPI=
- if you see some exception, attach a full log (just make sure there is not private information in it)
- if you think it can help, attach screenshots

** common issues
- run =hpi config check=, it help to spot certain errors
  Also really recommended to install =mypy= first, it really helps to spot various trivial errors
- if =hpi= shows you something like 'command not found', try using =python3 -m my.core= instead
  This likely means that your =$HOME/.local/bin= directory isn't in your =$PATH=

* Usage examples

** End-to-end Roam Research setup
In [[https://beepb00p.xyz/myinfra-roam.html#export][this]] post you can trace all steps:

- learn how to export your raw data
- integrate it with HPI package
- benefit from HPI integration

  - use interactively in ipython
  - use with [[https://github.com/karlicoss/orger][Orger]]
  - use with [[https://github.com/karlicoss/promnesia][Promnesia]]

If you want to set up a new data source, it could be a good learning reference.

** Polar
Polar doesn't require any setup as it accesses the highlights on your filesystem (usually in =~/.polar=).

You can try if it works with:

: python3 -c 'import my.polar as polar; print(polar.get_entries())'

** Google Takeout
If you have zip Google Takeout archives, you can use HPI to access it:

- prepare the config =~/.config/my/my/config.py=

  #+begin_src python
  class google:
      # you can pass the directory, a glob, or a single zip file
      takeout_path = '/backups/takeouts/*.zip'
  #+end_src

- use it:

  #+begin_src
  $ python3 -c 'import my.media.youtube as yt; print(yt.get_watched()[-1])'
  Watched(url='https://www.youtube.com/watch?v=p0t0J_ERzHM', title='Monster magnet meets monster magnet...', when=datetime.datetime(2020, 1, 22, 20, 34, tzinfo=<UTC>))
  #+end_src


** Kobo reader
Kobo module allows you to access the books you've read along with the highlights and notes.
It uses exports provided by [[https://github.com/karlicoss/kobuddy][kobuddy]] package.

- prepare the config

  1. Install =kobuddy= from PIP
  2. Add kobo config to =~/.config/my/my/config.py=
    #+begin_src python
    class kobo:
        export_dir = '/backups/to/kobo/'
    #+end_src

After that you should be able to use it:

#+begin_src bash
  python3 -c 'import my.books.kobo as kobo; print(kobo.get_highlights())'
#+end_src

** Orger
# TODO include this from orger docs??

You can use [[https://github.com/karlicoss/orger][orger]] to get Org-mode representations of your data.

Some examples (assuming you've [[https://github.com/karlicoss/orger#installing][installed]] Orger):

*** Orger + [[https://github.com/burtonator/polar-bookshelf][Polar]]

This will mirror Polar highlights as org-mode:

: orger/modules/polar.py --to polar.org

** =demo.py=
read/run [[../demo.py][demo.py]] for a full demonstration of setting up Hypothesis (uses annotations data from a public Github repository)

* Data flow
# todo eh, could publish this as a blog page? dunno

Here, I'll demonstrate how data flows into and from HPI on several examples, starting from the simplest to more complicated.

If you want to see how it looks as a whole, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]]!

** Polar Bookshelf
Polar keeps the data:

- *locally*, on your disk
- in =~/.polar=,
- as a bunch of *JSON files*

It's excellent from all perspectives, except one -- you can only use meaningfully use it through Polar app.
However, you might want to integrate your data elsewhere and use it in ways that Polar developers never even anticipated!

If you check the data layout ([[https://github.com/TheCedarPrince/KnowledgeRepository][example]]), you can see it's messy: scattered across multiple directories, contains raw HTML, obscure entities, etc.
It's understandable from the app developer's perspective, but it makes things frustrating when you want to work with this data.

# todo hmm what if I could share deserialization with Polar app?

Here comes the HPI [[file:../my/polar.py][polar module]]!

: |💾 ~/.polar (raw JSON data) |
:             ⇓⇓⇓
:    HPI (my.polar)
:             ⇓⇓⇓
:    < python interface >

So the data is read from the =|💾 filesystem |=, processed/normalized with HPI, which results in a nice programmatic =< interface >= for Polar data.

Note that it doesn't require any extra configuration -- it "just" works because the data is kept locally in the *known location*.

** Google Takeout
# TODO twitter archive might be better here?
Google Takeout exports are, unfortunately, manual (or semi-manual if you do some [[https://beepb00p.xyz/my-data.html#takeout][voodoo]] with mounting Google Drive).
Anyway, say you're doing it once in six months, so you end up with a several archives on your disk:

: /backups/takeout/takeout-20151201.zip
: ....
: /backups/takeout/takeout-20190901.zip
: /backups/takeout/takeout-20200301.zip

Inside the archives.... there is a [[https://www.specytech.com/blog/wp-content/uploads/2019/06/google-takeout-folder.png][bunch]] of random files from all your google services.
Lately, many of them are JSONs, but for example, in 2015 most of it was in HTMLs! It's a nightmare to work with, even when you're an experienced programmer.

# Even within a single data source (e.g. =My Activity/Search=) you have a mix of HTML and JSON files.
# todo eh, I need to actually add JSON processing first
Of course, HPI helps you here by encapsulating all this parsing logic and exposing Python interfaces instead.

:       < 🌐  Google |
:              ⇓⇓⇓
:     { manual download }
:              ⇓⇓⇓
:  |💾 /backups/takeout/*.zip |
:              ⇓⇓⇓
:    HPI (my.google.takeout)
:              ⇓⇓⇓
:     < python interface >

The only thing you need to do is to tell it where to find the files on your disk, via [[file:MODULES.org::#mygoogletakeoutpaths][the config]], because different people use different paths for backups.

# TODO how to emphasize config?

** Reddit

Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!

- first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI

  It doesn't deal with all with the complexities of API interactions.
  Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.

As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetched by [[https://github.com/karlicoss/rexport][rexport]] library. So the pipeline looks like:

:       < 🌐  Reddit |
:              ⇓⇓⇓
:     { rexport/export.py (automatic, e.g. cron) }
:              ⇓⇓⇓
:  |💾 /backups/reddit/*.json |
:              ⇓⇓⇓
:      HPI (my.reddit.rexport)
:              ⇓⇓⇓
:     < python interface >

So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.

But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.

Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.

Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=

** Twitter

Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.

The reason to use multiple in case of Twitter is:

- there is official Twitter Archive, but it's manual, takes several days to complete and hard to automate.
- there is [[https://github.com/twintproject/twint][twint]], which can get real-time Twitter data via scraping

  But Twitter has a limitation and you can't get data past 3200 tweets through API or scraping.

So the idea is to export both data sources on your disk:

:                              < 🌐  Twitter |
:                              ⇓⇓            ⇓⇓
:     { manual archive download }           { twint (automatic, cron) }
:              ⇓⇓⇓                                   ⇓⇓⇓
:  |💾 /backups/twitter-archives/*.zip |     |💾 /backups/twint/db.sqlite |
:                                 .............

# TODO note that the left and right parts of the diagram ('before filesystem' and 'after filesystem') are completely independent!
# if something breaks, you can still read your old data from the filesystem!

What we do next is:

1. Process raw data from twitter archives (manual export, but has all the data)
2. Process raw data from twint database (automatic export, but only recent data)
3. Merge them together, overlaying twint data on top of twitter archive data

:                                 .............
:  |💾 /backups/twitter-archives/*.zip |     |💾 /backups/twint/db.sqlite |
:              ⇓⇓⇓                                   ⇓⇓⇓
:      HPI (my.twitter.archive)              HPI (my.twitter.twint)
:       ⇓                     ⇓              ⇓                    ⇓
:       ⇓                   HPI (my.twitter.all)                  ⇓
:       ⇓                           ⇓⇓                            ⇓
: < python interface>       < python interface>          < python interface>

For merging the data, we're using a tiny auxiliary module, =my.twitter.all= (It's just 20 lines of code, [[file:../my/twitter/all.py][check it out]]).

Since you have two different sources of raw data, you need to specify two bits of config:
# todo link to modules thing?

: class twint:
:     export_path = '/backups/twint/db.sqlite'

: class twitter_archive:
:     export_path = '/backups/twitter-archives/*.zip'

Note that you can also just use =my.twitter.archive= or =my.twitter.twint= directly, or set either of paths to empty string: =''=

# #addingmodifying-modules
# Now, say you prefer to use a different library for your Twitter data instead of twint (for whatever reason), and you want to use it
# TODO docs on overlays?

** Connecting to other apps
As a user you might not be so interested in Python interface per se.. but a nice thing about having one is that it's easy to
connect the data with other apps and libraries!

:                          /---- 💻promnesia --- | browser extension  >
: | python interface > ----+---- 💻orger     --- |💾 org-mode mirror  |
:                          +-----💻memacs    --- |💾 org-mode lifelog |
:                          +-----💻????      --- | REST api           >
:                          +-----💻????      --- | Datasette          >
:                          \-----💻????      --- | Memex              >

See more in [[file:../README.org::#how-do-you-use-it]["How do you use it?"]] section.

Also check out [[https://beepb00p.xyz/myinfra.html#hpi][my personal infrastructure map]] to see where I'm using HPI.

* Adding/modifying modules
# TODO link to 'overlays' documentation?
# TODO don't be afraid to TODO make sure to install in editable mode

- The easiest is just to clone HPI repository and run an editable PIP install (=pip3 install --user -e .=), or via [[#use-without-installing][with_my]] wrapper.

  After that you can just edit the code directly, your changes will be reflected immediately, and you will be able to quickly iterate/fix bugs/add new methods.

  This is great if you just want to add a few of your own personal modules, or make minimal changes to a few files. If you do much more than that, you may run into possible merge conflicts if/when you update (~git pull~) HPI

# TODO eh. doesn't even have to be in 'my' namespace?? need to check it
- The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to =PYTHONPATH=.

  # hmmm seems to be no obvious way to link to a header in a separate file,
  # if you want this in both emacs and how github renders org mode
  # https://github.com/karlicoss/HPI/pull/160#issuecomment-817318076
  See [[file:MODULE_DESIGN.org#addingmodules][MODULE_DESIGN/adding modules]] for more information