473 lines
21 KiB
Org Mode
473 lines
21 KiB
Org Mode
# TODO FAQ??
|
|
Please don't be shy and raise issues if something in the instructions is unclear.
|
|
You'd be really helping me, I want to make the setup as straightforward as possible!
|
|
|
|
# update with org-make-toc
|
|
* TOC
|
|
:PROPERTIES:
|
|
:TOC: :include all
|
|
:END:
|
|
|
|
:CONTENTS:
|
|
- [[#toc][TOC]]
|
|
- [[#few-notes][Few notes]]
|
|
- [[#install-main-hpi-package][Install main HPI package]]
|
|
- [[#option-1-install-from-pip][option 1: install from PIP]]
|
|
- [[#option-2-localeditable-install][option 2: local/editable install]]
|
|
- [[#option-3-use-without-installing][option 3: use without installing]]
|
|
- [[#appendix-optional-packages][appendix: optional packages]]
|
|
- [[#setting-up-modules][Setting up modules]]
|
|
- [[#private-configuration-myconfig][private configuration (my.config)]]
|
|
- [[#module-dependencies][module dependencies]]
|
|
- [[#troubleshooting][Troubleshooting]]
|
|
- [[#common-issues][common issues]]
|
|
- [[#usage-examples][Usage examples]]
|
|
- [[#end-to-end-roam-research-setup][End-to-end Roam Research setup]]
|
|
- [[#polar][Polar]]
|
|
- [[#google-takeout][Google Takeout]]
|
|
- [[#kobo-reader][Kobo reader]]
|
|
- [[#orger][Orger]]
|
|
- [[#orger--polar][Orger + Polar]]
|
|
- [[#demopy][demo.py]]
|
|
- [[#data-flow][Data flow]]
|
|
- [[#polar-bookshelf][Polar Bookshelf]]
|
|
- [[#google-takeout][Google Takeout]]
|
|
- [[#reddit][Reddit]]
|
|
- [[#twitter][Twitter]]
|
|
- [[#connecting-to-other-apps][Connecting to other apps]]
|
|
- [[#addingmodifying-modules][Adding/modifying modules]]
|
|
:END:
|
|
|
|
|
|
* Few notes
|
|
I understand that people who'd like to use this may not be super familiar with Python, pip or generally unix, so here are some useful notes:
|
|
|
|
- only ~python >= 3.7~ is supported
|
|
- I'm using ~pip3~ command, but on your system you might only have ~pip~.
|
|
|
|
If your ~pip --version~ says python 3, feel free to use ~pip~.
|
|
|
|
- If you have issues getting ~pip~ or ~pip3~ to work, it may be worth invoking the module instead using a fully qualified path, like ~python3 -m pip~ (e.g. ~python3 -m pip install --user ..~)
|
|
|
|
- similarly, I'm using =python3= in the documentation, but if your =python --version= says python3, it's okay to use =python=
|
|
|
|
- when you are using ~pip install~, [[https://stackoverflow.com/a/42989020/706389][always pass]] =--user=, and *never install third party packages with sudo* (unless you know what you are doing)
|
|
- throughout the guide I'm assuming the user config directory is =~/.config=, but it's *different on Mac/Windows*.
|
|
|
|
See [[https://github.com/ActiveState/appdirs/blob/3fe6a83776843a46f20c2e5587afcffe05e03b39/appdirs.py#L187-L190][this]] if you're not sure what's your user config dir.
|
|
|
|
* Install main HPI package
|
|
This is a *required step*
|
|
|
|
You can choose one of the following options:
|
|
|
|
** option 1: install from [[https://pypi.org/project/HPI][PIP]]
|
|
This is the *easiest way*:
|
|
|
|
: pip3 install --user HPI
|
|
|
|
** option 2: local/editable install
|
|
This is convenient if you're planning to add new modules or change the existing ones.
|
|
|
|
1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
|
|
2. Go into the project directory: =cd /path/to/hpi=
|
|
2. Run ~pip3 install --user -e .~
|
|
|
|
This will install the package in 'editable mode'.
|
|
It means that any changes to =/path/to/hpi= will be immediately reflected without need to reinstall anything.
|
|
|
|
It's *extremely* convenient for developing and debugging.
|
|
|
|
** option 3: use without installing
|
|
This is less convenient, but gives you more control.
|
|
|
|
1. Clone the repository: =git clone git@github.com:karlicoss/HPI.git /path/to/hpi=
|
|
2. Go into the project directory: =cd /path/to/hpi=
|
|
3. Install the dependencies: ~python3 setup.py --dependencies-only~
|
|
4. Use =with_my= script to get access to ~my.~ modules.
|
|
|
|
For example:
|
|
|
|
: /path/to/hpi/with_my python3 -c 'from my.pinboard import bookmarks; print(list(bookmarks()))'
|
|
|
|
It's also convenient to put a symlink to =with_my= somewhere in your system path so you can run it from anywhere, or add an alias in your bashrc:
|
|
|
|
: alias with_my='/path/to/hpi/with_my'
|
|
|
|
After that, you can wrap your command in =with_my= to give it access to ~my.~ modules, e.g. see [[#usage-examples][examples]].
|
|
|
|
The benefit of this way is that you get a bit more control, explicitly allowing your scripts to use your data.
|
|
|
|
** appendix: optional packages
|
|
You can also install some optional packages
|
|
|
|
: pip3 install 'HPI[optional]'
|
|
|
|
They aren't necessary, but will improve your experience. At the moment these are:
|
|
|
|
- [[https://github.com/karlicoss/cachew][cachew]]: automatic caching library, which can greatly speedup data access
|
|
- [[https://github.com/metachris/logzero][logzero]]: a nice logging library, supporting colors
|
|
- [[https://github.com/ijl/orjson][orjson]]: a library for serializing data to JSON, used in ~my.core.serialize~ and the ~hpi query~ interface
|
|
- [[https://github.com/python/mypy][mypy]]: mypy is used for checking configs and troubleshooting
|
|
|
|
* Setting up modules
|
|
This is an *optional step* as few modules work without extra setup.
|
|
But it depends on the specific module.
|
|
|
|
See [[file:MODULES.org][MODULES]] to read documentation on specific modules that interest you.
|
|
|
|
You might also find interesting to read [[file:CONFIGURING.org][CONFIGURING]], where I'm
|
|
elaborating on some technical rationales behind the current configuration system.
|
|
|
|
** private configuration (=my.config=)
|
|
# TODO write about dynamic configuration
|
|
# todo add a command to edit config?? e.g. HPI config edit
|
|
If you're not planning to use private configuration (some modules don't need it) you can skip straight to the next step. Still, I'd recommend you to read anyway.
|
|
|
|
The configuration usually contains paths to the data on your disk, and some modules have extra settings.
|
|
The config is simply a *python package* (named =my.config=), expected to be in =~/.config/my=.
|
|
If you'd like to change the location of the =my.config= directory, you can set the =MY_CONFIG= environment variable. e.g. in your .bashrc add: ~export MY_CONFIG=$HOME/.my/~
|
|
|
|
Since it's a Python package, generally it's very *flexible* and there are many ways to set it up.
|
|
|
|
- *The simplest way*
|
|
|
|
After installing HPI, run =hpi config create=.
|
|
|
|
This will create an empty config file for you (usually, in =~/.config/my=), which you can edit. Example configuration:
|
|
|
|
#+begin_src python
|
|
import pytz # yes, you can use any Python stuff in the config
|
|
|
|
class emfit:
|
|
export_path = '/data/exports/emfit'
|
|
tz = pytz.timezone('Europe/London')
|
|
excluded_sids = []
|
|
cache_path = '/tmp/emfit.cache'
|
|
|
|
class instapaper:
|
|
export_path = '/data/exports/instapaper'
|
|
|
|
class roamresearch:
|
|
export_path = '/data/exports/roamresearch'
|
|
username = 'karlicoss'
|
|
#+end_src
|
|
|
|
To find out which attributes you need to specify:
|
|
|
|
- check in [[file:MODULES.org][MODULES]]
|
|
- check in [[file:../my/config.py][the default config stubs]]
|
|
- if there is nothing there, the easiest is perhaps to skim through the module's code and search for =config.= uses.
|
|
|
|
For example, if you search for =config.= in [[file:../my/emfit/__init__.py][emfit module]], you'll see that it's using =export_path=, =tz=, =excluded_sids= and =cache_path=.
|
|
|
|
- or you can just try running them and fill in the attributes Python complains about!
|
|
|
|
or run =hpi doctor my.modulename=
|
|
|
|
# TODO link to post about exports?
|
|
** module dependencies
|
|
Dependencies are different for specific modules you're planning to use, so it's hard to tell in advance what you'll need.
|
|
|
|
First thing you should try is just using the module; if it works -- great! If it doesn't (i.e. you get something like =ImportError=):
|
|
|
|
- try using =hpi module install modulename= (where =<modulename>= is something like =my.hypothesis=, etc.)
|
|
|
|
This command uses [[https://github.com/karlicoss/HPI/search?l=Python&q=REQUIRES][REQUIRES]] declaration to install the dependencies.
|
|
|
|
- otherwise manually install missing packages via ~pip3 install --user~
|
|
|
|
Also please feel free to report if the command above didn't install some dependencies!
|
|
|
|
|
|
* Troubleshooting
|
|
# todo replace with_my with it??
|
|
|
|
HPI comes with a command line tool that can help you detect potential issues. Run:
|
|
|
|
: hpi doctor
|
|
: # alternatively, for more output:
|
|
: hpi doctor --verbose
|
|
|
|
If you only have a few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.
|
|
|
|
If you're having issues with ~cachew~ or want to show logs to troubleshoot what may be happening, you can pass the debug flag (e.g., ~hpi --debug doctor my.module_name~) or set the ~HPI_LOGS~ environment variable (e.g., ~HPI_LOGS=debug hpi query my.module_name~) to print all logs, including the ~cachew~ dependencies. ~HPI_LOGS~ could also be used to silence ~info~ logs, like ~HPI_LOGS=warning hpi ...~
|
|
|
|
If you want ~HPI~ to autocomplete the module names for you, this comes with shell completion, see [[../misc/completion/][misc/completion]]
|
|
|
|
If you have any ideas on how to improve it, please let me know!
|
|
|
|
Here's a screenshot how it looks when everything is mostly good: [[https://user-images.githubusercontent.com/291333/82806066-f7dfe400-9e7c-11ea-8763-b3bee8ada308.png][link]].
|
|
|
|
If you experience issues, feel free to report, but please attach your:
|
|
|
|
- OS version
|
|
- python version: =python3 --version=
|
|
- HPI version: =pip3 show HPI=
|
|
- if you see some exception, attach a full log (just make sure there is not private information in it)
|
|
- if you think it can help, attach screenshots
|
|
|
|
** common issues
|
|
- run =hpi config check=, it help to spot certain errors
|
|
Also really recommended to install =mypy= first, it really helps to spot various trivial errors
|
|
- if =hpi= shows you something like 'command not found', try using =python3 -m my.core= instead
|
|
This likely means that your =$HOME/.local/bin= directory isn't in your =$PATH=
|
|
|
|
* Usage examples
|
|
|
|
** End-to-end Roam Research setup
|
|
In [[https://beepb00p.xyz/myinfra-roam.html#export][this]] post you can trace all steps:
|
|
|
|
- learn how to export your raw data
|
|
- integrate it with HPI package
|
|
- benefit from HPI integration
|
|
|
|
- use interactively in ipython
|
|
- use with [[https://github.com/karlicoss/orger][Orger]]
|
|
- use with [[https://github.com/karlicoss/promnesia][Promnesia]]
|
|
|
|
If you want to set up a new data source, it could be a good learning reference.
|
|
|
|
** Polar
|
|
Polar doesn't require any setup as it accesses the highlights on your filesystem (usually in =~/.polar=).
|
|
|
|
You can try if it works with:
|
|
|
|
: python3 -c 'import my.polar as polar; print(polar.get_entries())'
|
|
|
|
** Google Takeout
|
|
If you have zip Google Takeout archives, you can use HPI to access it:
|
|
|
|
- prepare the config =~/.config/my/my/config.py=
|
|
|
|
#+begin_src python
|
|
class google:
|
|
# you can pass the directory, a glob, or a single zip file
|
|
takeout_path = '/backups/takeouts/*.zip'
|
|
#+end_src
|
|
|
|
- use it:
|
|
|
|
#+begin_src
|
|
$ python3 -c 'import my.media.youtube as yt; print(yt.get_watched()[-1])'
|
|
Watched(url='https://www.youtube.com/watch?v=p0t0J_ERzHM', title='Monster magnet meets monster magnet...', when=datetime.datetime(2020, 1, 22, 20, 34, tzinfo=<UTC>))
|
|
#+end_src
|
|
|
|
|
|
** Kobo reader
|
|
Kobo module allows you to access the books you've read along with the highlights and notes.
|
|
It uses exports provided by [[https://github.com/karlicoss/kobuddy][kobuddy]] package.
|
|
|
|
- prepare the config
|
|
|
|
1. Install =kobuddy= from PIP
|
|
2. Add kobo config to =~/.config/my/my/config.py=
|
|
#+begin_src python
|
|
class kobo:
|
|
export_dir = '/backups/to/kobo/'
|
|
#+end_src
|
|
|
|
After that you should be able to use it:
|
|
|
|
#+begin_src bash
|
|
python3 -c 'import my.books.kobo as kobo; print(kobo.get_highlights())'
|
|
#+end_src
|
|
|
|
** Orger
|
|
# TODO include this from orger docs??
|
|
|
|
You can use [[https://github.com/karlicoss/orger][orger]] to get Org-mode representations of your data.
|
|
|
|
Some examples (assuming you've [[https://github.com/karlicoss/orger#installing][installed]] Orger):
|
|
|
|
*** Orger + [[https://github.com/burtonator/polar-bookshelf][Polar]]
|
|
|
|
This will mirror Polar highlights as org-mode:
|
|
|
|
: orger/modules/polar.py --to polar.org
|
|
|
|
** =demo.py=
|
|
read/run [[../demo.py][demo.py]] for a full demonstration of setting up Hypothesis (uses annotations data from a public Github repository)
|
|
|
|
* Data flow
|
|
# todo eh, could publish this as a blog page? dunno
|
|
|
|
Here, I'll demonstrate how data flows into and from HPI on several examples, starting from the simplest to more complicated.
|
|
|
|
If you want to see how it looks as a whole, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]]!
|
|
|
|
** Polar Bookshelf
|
|
Polar keeps the data:
|
|
|
|
- *locally*, on your disk
|
|
- in =~/.polar=,
|
|
- as a bunch of *JSON files*
|
|
|
|
It's excellent from all perspectives, except one -- you can only use meaningfully use it through Polar app.
|
|
However, you might want to integrate your data elsewhere and use it in ways that Polar developers never even anticipated!
|
|
|
|
If you check the data layout ([[https://github.com/TheCedarPrince/KnowledgeRepository][example]]), you can see it's messy: scattered across multiple directories, contains raw HTML, obscure entities, etc.
|
|
It's understandable from the app developer's perspective, but it makes things frustrating when you want to work with this data.
|
|
|
|
# todo hmm what if I could share deserialization with Polar app?
|
|
|
|
Here comes the HPI [[file:../my/polar.py][polar module]]!
|
|
|
|
: |💾 ~/.polar (raw JSON data) |
|
|
: ⇓⇓⇓
|
|
: HPI (my.polar)
|
|
: ⇓⇓⇓
|
|
: < python interface >
|
|
|
|
So the data is read from the =|💾 filesystem |=, processed/normalized with HPI, which results in a nice programmatic =< interface >= for Polar data.
|
|
|
|
Note that it doesn't require any extra configuration -- it "just" works because the data is kept locally in the *known location*.
|
|
|
|
** Google Takeout
|
|
# TODO twitter archive might be better here?
|
|
Google Takeout exports are, unfortunately, manual (or semi-manual if you do some [[https://beepb00p.xyz/my-data.html#takeout][voodoo]] with mounting Google Drive).
|
|
Anyway, say you're doing it once in six months, so you end up with a several archives on your disk:
|
|
|
|
: /backups/takeout/takeout-20151201.zip
|
|
: ....
|
|
: /backups/takeout/takeout-20190901.zip
|
|
: /backups/takeout/takeout-20200301.zip
|
|
|
|
Inside the archives.... there is a [[https://www.specytech.com/blog/wp-content/uploads/2019/06/google-takeout-folder.png][bunch]] of random files from all your google services.
|
|
Lately, many of them are JSONs, but for example, in 2015 most of it was in HTMLs! It's a nightmare to work with, even when you're an experienced programmer.
|
|
|
|
# Even within a single data source (e.g. =My Activity/Search=) you have a mix of HTML and JSON files.
|
|
# todo eh, I need to actually add JSON processing first
|
|
Of course, HPI helps you here by encapsulating all this parsing logic and exposing Python interfaces instead.
|
|
|
|
: < 🌐 Google |
|
|
: ⇓⇓⇓
|
|
: { manual download }
|
|
: ⇓⇓⇓
|
|
: |💾 /backups/takeout/*.zip |
|
|
: ⇓⇓⇓
|
|
: HPI (my.google.takeout)
|
|
: ⇓⇓⇓
|
|
: < python interface >
|
|
|
|
The only thing you need to do is to tell it where to find the files on your disk, via [[file:MODULES.org::#mygoogletakeoutpaths][the config]], because different people use different paths for backups.
|
|
|
|
# TODO how to emphasize config?
|
|
|
|
** Reddit
|
|
|
|
Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!
|
|
|
|
- first, there are excellent programmatic APIs for Reddit out there already, for example, [[https://github.com/praw-dev/praw][praw]]
|
|
- more importantly, this is the [[https://beepb00p.xyz/exports.html#design][design decision]] of HPI
|
|
|
|
It doesn't deal with all with the complexities of API interactions.
|
|
Instead, it relies on other tools to put *intermediate, raw data*, on your disk and then transforms this data into something nice.
|
|
|
|
As an example, for [[file:../my/reddit.py][Reddit]], HPI is relying on data fetched by [[https://github.com/karlicoss/rexport][rexport]] library. So the pipeline looks like:
|
|
|
|
: < 🌐 Reddit |
|
|
: ⇓⇓⇓
|
|
: { rexport/export.py (automatic, e.g. cron) }
|
|
: ⇓⇓⇓
|
|
: |💾 /backups/reddit/*.json |
|
|
: ⇓⇓⇓
|
|
: HPI (my.reddit.rexport)
|
|
: ⇓⇓⇓
|
|
: < python interface >
|
|
|
|
So, in your [[file:MODULES.org::#myreddit][reddit config]], similarly to Takeout, you need =export_path=, so HPI knows how to find your Reddit data on the disk.
|
|
|
|
But there is an extra caveat: rexport is already coming with nice [[https://github.com/karlicoss/rexport/blob/master/dal.py][data bindings]] to parse its outputs.
|
|
|
|
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
|
|
|
|
Since the [[https://github.com/karlicoss/rexport#api-limitations][reddit API has limited results]], you can use [[https://github.com/seanbreckenridge/pushshift_comment_export][my.reddit.pushshift]] to access older reddit comments, which both then get merged into =my.reddit.all.comments=
|
|
|
|
** Twitter
|
|
|
|
Twitter is interesting, because it's an example of an HPI module that *arbitrates* between several data sources from the same service.
|
|
|
|
The reason to use multiple in case of Twitter is:
|
|
|
|
- there is official Twitter Archive, but it's manual, takes several days to complete and hard to automate.
|
|
- there is [[https://github.com/twintproject/twint][twint]], which can get real-time Twitter data via scraping
|
|
|
|
But Twitter has a limitation and you can't get data past 3200 tweets through API or scraping.
|
|
|
|
So the idea is to export both data sources on your disk:
|
|
|
|
: < 🌐 Twitter |
|
|
: ⇓⇓ ⇓⇓
|
|
: { manual archive download } { twint (automatic, cron) }
|
|
: ⇓⇓⇓ ⇓⇓⇓
|
|
: |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite |
|
|
: .............
|
|
|
|
# TODO note that the left and right parts of the diagram ('before filesystem' and 'after filesystem') are completely independent!
|
|
# if something breaks, you can still read your old data from the filesystem!
|
|
|
|
What we do next is:
|
|
|
|
1. Process raw data from twitter archives (manual export, but has all the data)
|
|
2. Process raw data from twint database (automatic export, but only recent data)
|
|
3. Merge them together, overlaying twint data on top of twitter archive data
|
|
|
|
: .............
|
|
: |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite |
|
|
: ⇓⇓⇓ ⇓⇓⇓
|
|
: HPI (my.twitter.archive) HPI (my.twitter.twint)
|
|
: ⇓ ⇓ ⇓ ⇓
|
|
: ⇓ HPI (my.twitter.all) ⇓
|
|
: ⇓ ⇓⇓ ⇓
|
|
: < python interface> < python interface> < python interface>
|
|
|
|
For merging the data, we're using a tiny auxiliary module, =my.twitter.all= (It's just 20 lines of code, [[file:../my/twitter/all.py][check it out]]).
|
|
|
|
Since you have two different sources of raw data, you need to specify two bits of config:
|
|
# todo link to modules thing?
|
|
|
|
: class twint:
|
|
: export_path = '/backups/twint/db.sqlite'
|
|
|
|
: class twitter_archive:
|
|
: export_path = '/backups/twitter-archives/*.zip'
|
|
|
|
Note that you can also just use =my.twitter.archive= or =my.twitter.twint= directly, or set either of paths to empty string: =''=
|
|
|
|
# #addingmodifying-modules
|
|
# Now, say you prefer to use a different library for your Twitter data instead of twint (for whatever reason), and you want to use it
|
|
# TODO docs on overlays?
|
|
|
|
** Connecting to other apps
|
|
As a user you might not be so interested in Python interface per se.. but a nice thing about having one is that it's easy to
|
|
connect the data with other apps and libraries!
|
|
|
|
: /---- 💻promnesia --- | browser extension >
|
|
: | python interface > ----+---- 💻orger --- |💾 org-mode mirror |
|
|
: +-----💻memacs --- |💾 org-mode lifelog |
|
|
: +-----💻???? --- | REST api >
|
|
: +-----💻???? --- | Datasette >
|
|
: \-----💻???? --- | Memex >
|
|
|
|
See more in [[file:../README.org::#how-do-you-use-it]["How do you use it?"]] section.
|
|
|
|
Also check out [[https://beepb00p.xyz/myinfra.html#hpi][my personal infrastructure map]] to see where I'm using HPI.
|
|
|
|
* Adding/modifying modules
|
|
# TODO link to 'overlays' documentation?
|
|
# TODO don't be afraid to TODO make sure to install in editable mode
|
|
|
|
- The easiest is just to clone HPI repository and run an editable PIP install (=pip3 install --user -e .=), or via [[#use-without-installing][with_my]] wrapper.
|
|
|
|
After that you can just edit the code directly, your changes will be reflected immediately, and you will be able to quickly iterate/fix bugs/add new methods.
|
|
|
|
This is great if you just want to add a few of your own personal modules, or make minimal changes to a few files. If you do much more than that, you may run into possible merge conflicts if/when you update (~git pull~) HPI
|
|
|
|
# TODO eh. doesn't even have to be in 'my' namespace?? need to check it
|
|
- The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to =PYTHONPATH=.
|
|
|
|
# hmmm seems to be no obvious way to link to a header in a separate file,
|
|
# if you want this in both emacs and how github renders org mode
|
|
# https://github.com/karlicoss/HPI/pull/160#issuecomment-817318076
|
|
See [[file:MODULE_DESIGN.org#addingmodules][MODULE_DESIGN/adding modules]] for more information
|