22 KiB
- TOC
- Few notes
- Install main HPI package
- Setting up modules
- Troubleshooting
- Usage examples
- Data flow
- Adding/modifying modules
Please don't be shy and raise issues if something in the instructions is unclear. You'd be really helping me, I want to make the setup as straightforward as possible!
Few notes
I understand that people who'd like to use this may not be super familiar with Python, PIP or generally unix, so here are some useful notes:
- only
python >= 3.6
is supported - I'm using
pip3
command, but on your system you might only havepip
. If yourpip --version
says python 3, feel free to usepip
. - similarly, I'm using
python3
in the documentation, but if yourpython --version
says python3, it's okay to usepython
- when you are using
pip install
, always pass--user
, and never install third party packages with sudo (unless you know what you are doing) - throughout the guide I'm assuming the user config directory is
~/.config
, but it's different on Mac/Windows. See this if you're not sure what's your user config dir.
Install main HPI package
This is a required step
You can choose one of the following options:
option 1: install from PIP
This is the easiest way:
pip3 install --user HPI
option 2: local/editable install
This is convenient if you're planning to add new modules or change the existing ones.
- Clone the repository:
git clone git@github.com:karlicoss/HPI.git /path/to/hpi
- Go into the project directory:
cd /path/to/hpi
- Run
pip3 install --user -e .
This will install the package in 'editable mode'. It means that any changes to/path/to/hpi
will be immediately reflected without need to reinstall anything. It's extremely convenient for developing and debugging.
option 3: use without installing
This is less convenient, but gives you more control.
- Clone the repository:
git clone git@github.com:karlicoss/HPI.git /path/to/hpi
- Go into the project directory:
cd /path/to/hpi
- Install the dependencies:
python3 setup.py --dependencies-only
-
Use
with_my
script to get access tomy.
modules.For example:
/path/to/hpi/with_my python3 -c 'from my.pinboard import bookmarks; print(list(bookmarks()))'
It's also convenient to put a symlink to
with_my
somewhere in your system path so you can run it from anywhere, or add an alias in your bashrc:alias with_my='/path/to/hpi/with_my'
After that, you can wrap your command in
with_my
to give it access tomy.
modules, e.g. see examples.
The benefit of this way is that you get a bit more control, explicitly allowing your scripts to use your data.
appendix: optional packages
You can also install some opional packages
pip3 install 'HPI[optional]'
They aren't necessary, but will improve your experience. At the moment these are:
Setting up modules
This is an optional step as few modules work without extra setup. But it depends on the specific module.
See MODULES to read documentation on specific modules that interest you.
You might also find interesting to read CONFIGURING, where I'm elaborating on some technical rationales behind the current configuration system.
private configuration (my.config
)
If you're not planning to use private configuration (some modules don't need it) you can skip straight to the next step. Still, I'd recommend you to read anyway.
The configuration contains paths to the data on your disks, links to external repositories, etc.
The config is simply a python package (named my.config
), expected to be in ~/.config/my
.
Since it's a Python package, generally it's very flexible and there are many ways to set it up.
-
The simplest way
After installing HPI, run
hpi config create
.This will create an empty config file for you (usually, in
~/.config/my
), which you can edit. Example configuration:import pytz # yes, you can use any Python stuff in the config class emfit: export_path = '/data/exports/emfit' tz = pytz.timezone('Europe/London') excluded_sids = [] cache_path = '/tmp/emfit.cache' class instapaper: export_path = '/data/exports/instapaper' class roamresearch: export_path = '/data/exports/roamresearch' username = 'karlicoss'
To find out which attributes you need to specify:
- check in MODULES
- if there is nothing there, the easiest is perhaps to skim through the code of the module and to search for
config.
uses. For example, if you search forconfig.
in emfit module, you'll see that it's usingexport_path
,tz
,excluded_sids
andcache_path
. - or you can just try running them and fill in the attributes Python complains about!
-
Another example is in example_config:
dir | example_config/ dir | example_config/my dir | example_config/my/config file | example_config/my/config/__init__.py --- """ Feel free to remove this if you don't need it/add your own custom settings and use them """ class hypothesis: # expects outputs from https://github.com/karlicoss/hypexport # (it's just the standard Hypothes.is export format) export_path = '/path/to/hypothesis/data' --- dir | example_config/my/config/repos symlink | example_config/my/config/repos/hypexport -> /tmp/my_demo/hypothesis_repo
As you can see, generally you specify fixed paths (e.g. to your backups directory) in __init__.py
.
Feel free to add other files as well though to organize better, it's a real Python package after all!
Some things (e.g. links to external packages like hypexport) are specified as ordinary symlinks in repos
directory.
That way you get easy imports (e.g. import my.config.repos.hypexport.model
) and proper IDE integration.
-
my own config layout is a bit more complicated:
~/.config/my/my/config/__init__.py ~/.config/my/my/config/locations.py ~/.config/my/my/config/repos ~/.config/my/my/config/repos/endoexport ~/.config/my/my/config/repos/fbmessengerexport ~/.config/my/my/config/repos/kobuddy ~/.config/my/my/config/repos/monzoexport ~/.config/my/my/config/repos/pockexport ~/.config/my/my/config/repos/rexport
module dependencies
Dependencies are different for specific modules you're planning to use, so it's hard to specify.
Generally you can just try using the module and then install missing packages via pip3 install --user
, should be fairly straightforward.
Troubleshooting
HPI comes with a command line tool that can help you detect potential issues. Run:
hpi doctor # alternatively, for more output: hpi doctor --verbose
If you only have few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.
If you have any ideas on how to improve it, please let me know!
Here's a screenshot how it looks when everything is mostly good: link.
Usage examples
If you run your script with with_my
wrapper, you'd have my
in PYTHONPATH
which gives you access to your data from within the script.
End-to-end Roam Research setup
Polar
Polar doesn't require any setup as it accesses the highlights on your filesystem (usually in ~/.polar
).
You can try if it works with:
python3 -c 'import my.reading.polar as polar; print(polar.get_entries())'
Google Takeout
If you have zip Google Takeout archives, you can use HPI to access it:
-
prepare the config
~/.config/my/my/config.py
class google: # you can pass the directory, a glob, or a single zip file takeout_path = '/backups/takeouts/*.zip'
-
use it:
$ python3 -c 'import my.media.youtube as yt; print(yt.get_watched()[-1])' Watched(url='https://www.youtube.com/watch?v=p0t0J_ERzHM', title='Monster magnet meets monster magnet...', when=datetime.datetime(2020, 1, 22, 20, 34, tzinfo=<UTC>))
Kobo reader
Kobo module allows you to access the books you've read along with the highlights and notes. It uses exports provided by kobuddy package.
-
prepare the config
- Point
ln -sfT /path/to/kobuddy ~/.config/my/my/config/repos/kobuddy
-
Add kobo config to
~/.config/my/my/config/__init__.py
class kobo: export_dir = '/backups/to/kobo/'
- Point
After that you should be able to use it:
python3 -c 'import my.books.kobo as kobo; print(kobo.get_highlights())'
Orger
demo.py
read/run demo.py for a full demonstration of setting up Hypothesis (uses annotations data from a public Github repository)
Data flow
Here, I'll demonstrate how data flows into and from HPI on several examples, starting from the simplest to more complicated.
If you want to see how it looks as a whole, check out my infrastructure map!
Polar Bookshelf
Polar keeps the data:
- locally, on your disk
- in
~/.polar
, - as a bunch of JSON files
It's excellent from all perspectives, except one – you can only use meaningfully use it through Polar app. Which is, by all means, great!
But you might want to integrate your data elsewhere and use it in ways that Polar developer never even anticipated!
If you check the data layout (example), you can see it's messy: scattered across multiple directories, contains raw HTML, obscure entities, etc. It's understandable from the app developer's perspective, but it makes things frustrating when you want to work with this data.
Here comes the HPI polar module!
|💾 ~/.polar (raw JSON data) | ⇓⇓⇓ HPI (my.reading.polar) ⇓⇓⇓ < python interface >
So the data is read from the |💾 filesystem |
, processed/normalized with HPI, which results in a nice programmatic < interface >
for Polar data.
Note that it doesn't require any extra configuration – it "just" works because the data is kept locally in the known location.
Google Takeout
Google Takeout exports are, unfortunately, manual (or semi-manual if you do some voodoo with mounting Google Drive). Anyway, say you're doing it once in six months, so you end up with a several archives on your disk:
/backups/takeout/takeout-20151201.zip .... /backups/takeout/takeout-20190901.zip /backups/takeout/takeout-20200301.zip
Inside the archives…. there is a bunch of random files from all your google services. Lately, many of them are JSONs, but for example, in 2015 most of it was in HTMLs! It's a nightmare to work with, even when you're an experienced programmer.
Of course, HPI helps you here by encapsulating all this parsing logic and exposing Python interfaces instead.
< 🌐 Google | ⇓⇓⇓ { manual download } ⇓⇓⇓ |💾 /backups/takeout/*.zip | ⇓⇓⇓ HPI (my.google.takeout) ⇓⇓⇓ < python interface >
The only thing you need to do is to tell it where to find the files on your disk, via the config, because different people use different paths for backups.
Reddit has a proper API, so in theory HPI could talk directly to Reddit and retrieve the latest data. But that's not what it doing!
- first, there are excellent programmatic APIs for Reddit out there already, for example, praw
- more importantly, this is the design decision of HP It doesn't deal with all with the complexities of API interactions. Instead, it relies on other tools to put intermediate, raw data, on your disk and then transforms this data into something nice.
As an example, for Reddit, HPI is relying on data fetched by rexport library. So the pipeline looks like:
< 🌐 Reddit | ⇓⇓⇓ { rexport/export.py (automatic, e.g. cron) } ⇓⇓⇓ |💾 /backups/reddit/*.json | ⇓⇓⇓ HPI (my.reddit) ⇓⇓⇓ < python interface >
So, in your reddit config, similarly to Takeout, you need export_path
, so HPI knows how to find your Reddit data on the disk.
But there is an extra caveat: rexport is already coming with nice data bindings to parse its outputs.
Another design decision of HPI is to use existing code and libraries as much as possible, so we also specify a path to rexport
repository in the config.
(note: in the future it's possible that rexport will be installed via PIP, I just haven't had time for it so far).
Several other HPI modules are following a similar pattern: hypothesis, instapaper, pinboard, kobo, etc.
Twitter is interesting, because it's an example of an HPI module that arbitrates between several data sources from the same service.
The reason to use multiple in case of Twitter is:
- there is official Twitter Archive, but it's manual, takes several days to complete and hard to automate.
- there is twint, which can get real-time Twitter data via scraping But Twitter has a limitation and you can't get data past 3200 tweets through API or scraping.
So the idea is to export both data sources on your disk:
< 🌐 Twitter | ⇓⇓ ⇓⇓ { manual archive download } { twint (automatic, cron) } ⇓⇓⇓ ⇓⇓⇓ |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite | .............
What we do next is:
- Process raw data from twitter archives (manual export, but has all the data)
- Process raw data from twint database (automatic export, but only recent data)
- Merge them together, overlaying twint data on top of twitter archive data
............. |💾 /backups/twitter-archives/*.zip | |💾 /backups/twint/db.sqlite | ⇓⇓⇓ ⇓⇓⇓ HPI (my.twitter.archive) HPI (my.twitter.twint) ⇓ ⇓ ⇓ ⇓ ⇓ HPI (my.twitter.all) ⇓ ⇓ ⇓⇓ ⇓ < python interface> < python interface> < python interface>
For merging the data, we're using a tiny auxiliary module, my.twitter.all
(It's just 20 lines of code, check it out).
Since you have two different sources of raw data, you need to specify two bits of config:
class twint: export_path = '/backups/twint/db.sqlite'
class twitter_archive: export_path = '/backups/twitter-archives/*.zip'
Note that you can also just use my.twitter.archive
or my.twitter.twint
directly, or set either of paths to empty string: ''
Connecting to other apps
As a user you might not be so interested in Python interface per se.. but a nice thing about having one is that it's easy to connect the data with other apps and libraries!
/---- 💻promnesia --- | browser extension > | python interface > ----+---- 💻orger --- |💾 org-mode mirror | +-----💻memacs --- |💾 org-mode lifelog | +-----💻???? --- | REST api > +-----💻???? --- | Datasette > \-----💻???? --- | Memex >
See more in "How do you use it?" section.
Adding/modifying modules
The easiest is just to run HPI via with_my wrapper or with an editable PIP install. That way your changes will be reflected immediately, and you will be able to quickly iterate/fix bugs/add new methods.
The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to PYTHONPATH
.
For example, if you want to add an awesomedatasource
, it could be:
custom_module └── my └──awesomedatasource.py
You can use all existing HPI modules in awesomedatasource.py
, for example, my.config
, or everything from my.core
.
But also, you can use override the builtin HPI modules too:
custom_reddit_overlay └── my └──reddit.py
Now if you add my_reddit_overlay
in the front of PYTHONPATH
, all the downstream scripts using my.reddit
will load it from custom_reddit_overlay
instead.
This could be useful to monkey patch some behaviours, or dynamically add some extra data sources – anything that comes to your mind.
I'll put up a better guide on this, in the meantime see "namespace packages" for more info.