HPI/doc/MODULES.org
2022-04-27 07:57:16 +01:00

393 lines
14 KiB
Org Mode

This file is an overview of *documented* modules (which I'm progressively expanding).
There are many more, see:
- [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules.
- you can also run =hpi modules= to list what's available on your system
- [[https://github.com/karlicoss/HPI][source code]] is always the primary source of truth
If you have some issues with the setup, see [[file:SETUP.org::#troubleshooting]["Troubleshooting"]].
* TOC
:PROPERTIES:
:TOC: :include all
:END:
:CONTENTS:
- [[#toc][TOC]]
- [[#intro][Intro]]
- [[#configs][Configs]]
- [[#mygoogletakeoutparser][my.google.takeout.parser]]
- [[#myhypothesis][my.hypothesis]]
- [[#myreddit][my.reddit]]
- [[#mybrowser][my.browser]]
- [[#mylocation][my.location]]
- [[#mytimetzvia_location][my.time.tz.via_location]]
- [[#mypocket][my.pocket]]
- [[#mytwittertwint][my.twitter.twint]]
- [[#mytwitterarchive][my.twitter.archive]]
- [[#mylastfm][my.lastfm]]
- [[#mypolar][my.polar]]
- [[#myinstapaper][my.instapaper]]
- [[#mygithubgdpr][my.github.gdpr]]
- [[#mygithubghexport][my.github.ghexport]]
- [[#mykobo][my.kobo]]
:END:
* Intro
See [[file:SETUP.org][SETUP]] to find out how to set up your own config.
Some explanations:
- =MY_CONFIG= is the path where you are keeping your private configuration (usually =~/.config/my/=)
- [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path=
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths.
It's 'smart', allows you to be flexible about your config:
- simple =str= or a =Path=
- =/a/path/to/directory/=, so the module will consume all files from this directory
- a list of files/directories (it will be flattened)
- a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
- empty string (e.g. ~export_path = ''~), this will prevent the module from consuming any data
This can be useful for modules that merge multiple data sources (for example, =my.twitter= or =my.github=)
Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]].
- if the field has a default value, you can omit it from your private config altogether
For more thoughts on modules and their structure, see [[file:MODULE_DESIGN.org][MODULE_DESIGN]]
* all.py
Some modules have lots of different sources for data. For example,
~my.location~ (location data) has lots of possible sources -- from
~my.google.takeout.parser~, using the ~gpslogger~ android app, or through
geolocating ~my.ip~ addresses. If you only plan on using one the modules, you
can just import from the individual module, (e.g. ~my.google.takeout.parser~)
or you can disable the others using the ~core~ config -- See the
[[https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy][MODULE_DESIGN]] docs for more details.
* Configs
The config snippets below are meant to be modified accordingly and *pasted into your private configuration*, e.g =$MY_CONFIG/my/config.py=.
You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.
For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[https://github.com/seanbreckenridge/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]
# Nested Configurations before the doc generation using the block below
** [[file:../my/reddit][my.reddit]]
Reddit data: saved items/comments/upvotes/etc.
# Note: can't be generated as easily since this is a nested configuration object
#+begin_src python
class reddit:
class rexport:
'''
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
'''
# path[s]/glob to the exported JSON data
export_path: Paths
class pushshift:
'''
Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
'''
# path[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/browser/][my.browser]]
Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]]
#+begin_src python
class browser:
class export:
# path[s]/glob to your backed up browser history sqlite files
export_path: Paths
class active_browser:
# paths to sqlite database files which you use actively
# to read from. For example:
# from browserexport.browsers.all import Firefox
# active_databases = Firefox.locate_database()
export_path: Paths
#+end_src
** [[file:../my/location][my.location]]
Merged location history from lots of sources.
The main sources here are
[[https://github.com/mendhak/gpslogger][gpslogger]] .gpx (XML) files, and
google takeout (using =my.google.takeout.parser=), with a fallback on
manually defined home locations.
You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to
provide geolocation data for an IPs (though no IPs are provided from any
of the sources here). For an example of usage, see [[https://github.com/seanbreckenridge/HPI/tree/master/my/ip][here]]
#+begin_src python
class location:
home = (
# supports ISO strings
('2005-12-04' , (42.697842, 23.325973)), # Bulgaria, Sofia
# supports date/datetime objects
(date(year=1980, month=2, day=15) , (40.7128 , -74.0060 )), # NY
(datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558 , 37.6173 )), # Moscow, Russia
)
# note: order doesn't matter, will be sorted in the data provider
class gpslogger:
# path[s]/glob to the exported gpx files
export_path: Paths
# default accuracy for gpslogger
accuracy: float = 50.0
class via_ip:
# guess ~15km accuracy for IP addresses
accuracy: float = 15_000
#+end_src
** [[file:../my/time/tz/via_location.py][my.time.tz.via_location]]
Uses the =my.location= module to determine the timezone for a location.
This can be used to 'localize' timezones. Most modules here return
datetimes in UTC, to prevent confusion whether or not its a local
timezone, one from UTC, or one in your timezone.
Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.:
- if your objects already have tz info, you might not need to call localize() at all
- it's safer when either all of your objects are tz aware or all are tz unware, not a mixture
- you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable
#+begin_src python
TzPolicy = Literal[
'keep' , # if datetime is tz aware, just preserve it
'convert', # if datetime is tz aware, convert to provider's tz
'throw' , # if datetime is tz aware, throw exception
]
#+end_src
This is still a work in progress, plan is to integrate it with =hpi query=
so that you can easily convert/localize timezones for some module/data
#+begin_src python
class time:
class tz:
policy = 'keep'
class via_location:
# less precise, but faster
fast: bool = True
# if the accuracy for the location is more than 5km (this
# isn't an accurate location, so shouldn't use it to determine
# timezone), don't use
require_accuracy: float = 5_000
#+end_src
# TODO hmm. drawer raw means it can output outlines, but then have to manually erase the generated results. ugh.
#+begin_src python :dir .. :results output drawer raw :exports result
# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
# yep.. https://stackoverflow.com/q/41203765/706389
import importlib
# from lint import all_modules # meh
# TODO figure out how to discover configs automatically...
modules = [
('google' , 'my.google.takeout.parser'),
('hypothesis' , 'my.hypothesis' ),
('pocket' , 'my.pocket' ),
('twint' , 'my.twitter.twint' ),
('twitter_archive', 'my.twitter.archive' ),
('lastfm' , 'my.lastfm' ),
('polar' , 'my.polar' ),
('instapaper' , 'my.instapaper' ),
('github' , 'my.github.gdpr' ),
('github' , 'my.github.ghexport' ),
('kobo' , 'my.kobo' ),
]
def indent(s, spaces=4):
return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))
from pathlib import Path
import inspect
from dataclasses import fields
import re
print('\n') # ugh. hack for org-ruby drawers bug
for cls, p in modules:
m = importlib.import_module(p)
C = getattr(m, cls)
src = inspect.getsource(C)
i = src.find('@property')
if i != -1:
src = src[:i]
src = src.strip()
src = re.sub(r'(class \w+)\(.*', r'\1:', src)
mpath = p.replace('.', '/')
for x in ['.py', '__init__.py']:
if Path(mpath + x).exists():
mpath = mpath + x
print(f'** [[file:../{mpath}][{p}]]')
mdoc = m.__doc__
if mdoc is not None:
print(indent(mdoc))
print(f' #+begin_src python')
print(indent(src))
print(f' #+end_src')
#+end_src
#+RESULTS:
** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]]
Parses Google Takeout using [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]]
See [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts
If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't
cache individual exports in =~/.cache/google_takeout_parser=
The directory set as takeout_path can be unpacked directories, or
zip files of the exports, which are temporarily unpacked while creating
the cachew cache
#+begin_src python
class google(user_config):
# directory which includes unpacked/zipped takeouts
takeout_path: Paths
error_policy: ErrorPolicy = 'yield'
# experimental flag to use core.kompress.ZipPath
# instead of unpacking to a tmp dir via match_structure
_use_zippath: bool = False
#+end_src
** [[file:../my/hypothesis.py][my.hypothesis]]
[[https://hypothes.is][Hypothes.is]] highlights and annotations
#+begin_src python
class hypothesis:
'''
Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
'''
# paths[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/pocket.py][my.pocket]]
[[https://getpocket.com][Pocket]] bookmarks and highlights
#+begin_src python
class pocket:
'''
Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
'''
# paths[s]/glob to the exported JSON data
export_path: Paths
#+end_src
** [[file:../my/twitter/twint.py][my.twitter.twint]]
Twitter data (tweets and favorites).
Uses [[https://github.com/twintproject/twint][Twint]] data export.
Requirements: =pip3 install --user dataset=
#+begin_src python
class twint:
export_path: Paths # path[s]/glob to the twint Sqlite database
#+end_src
** [[file:../my/twitter/archive.py][my.twitter.archive]]
Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])
#+begin_src python
class twitter_archive:
export_path: Paths # path[s]/glob to the twitter archive takeout
#+end_src
** [[file:../my/lastfm][my.lastfm]]
Last.fm scrobbles
#+begin_src python
class lastfm:
"""
Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
"""
export_path: Paths
#+end_src
** [[file:../my/polar.py][my.polar]]
[[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights
#+begin_src python
class polar:
'''
Polar config is optional, you only need it if you want to specify custom 'polar_dir'
'''
polar_dir: PathIsh = Path('~/.polar').expanduser()
defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)
#+end_src
** [[file:../my/instapaper.py][my.instapaper]]
[[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations
#+begin_src python
class instapaper:
'''
Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
'''
# path[s]/glob to the exported JSON data
export_path : Paths
#+end_src
** [[file:../my/github/gdpr.py][my.github.gdpr]]
Github data (uses [[https://github.com/settings/admin][official GDPR export]])
#+begin_src python
class github:
gdpr_dir: PathIsh # path to unpacked GDPR archive
#+end_src
** [[file:../my/github/ghexport.py][my.github.ghexport]]
Github data: events, comments, etc. (API data)
#+begin_src python
class github:
'''
Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
'''
# path[s]/glob to the exported JSON data
export_path: Paths
# path to a cache directory
# if omitted, will use /tmp
cache_dir: Optional[PathIsh] = None
#+end_src
** [[file:../my/kobo.py][my.kobo]]
[[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats
#+begin_src python
class kobo:
'''
Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
'''
# path[s]/glob to the exported databases
export_path: Paths
#+end_src