393 lines
14 KiB
Org Mode
393 lines
14 KiB
Org Mode
This file is an overview of *documented* modules (which I'm progressively expanding).
|
|
|
|
There are many more, see:
|
|
|
|
- [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules.
|
|
- you can also run =hpi modules= to list what's available on your system
|
|
- [[https://github.com/karlicoss/HPI][source code]] is always the primary source of truth
|
|
|
|
If you have some issues with the setup, see [[file:SETUP.org::#troubleshooting]["Troubleshooting"]].
|
|
|
|
* TOC
|
|
:PROPERTIES:
|
|
:TOC: :include all
|
|
:END:
|
|
:CONTENTS:
|
|
- [[#toc][TOC]]
|
|
- [[#intro][Intro]]
|
|
- [[#configs][Configs]]
|
|
- [[#mygoogletakeoutparser][my.google.takeout.parser]]
|
|
- [[#myhypothesis][my.hypothesis]]
|
|
- [[#myreddit][my.reddit]]
|
|
- [[#mybrowser][my.browser]]
|
|
- [[#mylocation][my.location]]
|
|
- [[#mytimetzvia_location][my.time.tz.via_location]]
|
|
- [[#mypocket][my.pocket]]
|
|
- [[#mytwittertwint][my.twitter.twint]]
|
|
- [[#mytwitterarchive][my.twitter.archive]]
|
|
- [[#mylastfm][my.lastfm]]
|
|
- [[#mypolar][my.polar]]
|
|
- [[#myinstapaper][my.instapaper]]
|
|
- [[#mygithubgdpr][my.github.gdpr]]
|
|
- [[#mygithubghexport][my.github.ghexport]]
|
|
- [[#mykobo][my.kobo]]
|
|
:END:
|
|
|
|
* Intro
|
|
|
|
See [[file:SETUP.org][SETUP]] to find out how to set up your own config.
|
|
|
|
Some explanations:
|
|
|
|
- =MY_CONFIG= is the path where you are keeping your private configuration (usually =~/.config/my/=)
|
|
- [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths
|
|
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path=
|
|
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths.
|
|
|
|
It's 'smart', allows you to be flexible about your config:
|
|
|
|
- simple =str= or a =Path=
|
|
- =/a/path/to/directory/=, so the module will consume all files from this directory
|
|
- a list of files/directories (it will be flattened)
|
|
- a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
|
|
- empty string (e.g. ~export_path = ''~), this will prevent the module from consuming any data
|
|
|
|
This can be useful for modules that merge multiple data sources (for example, =my.twitter= or =my.github=)
|
|
|
|
Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]].
|
|
|
|
- if the field has a default value, you can omit it from your private config altogether
|
|
|
|
For more thoughts on modules and their structure, see [[file:MODULE_DESIGN.org][MODULE_DESIGN]]
|
|
|
|
* all.py
|
|
|
|
Some modules have lots of different sources for data. For example,
|
|
~my.location~ (location data) has lots of possible sources -- from
|
|
~my.google.takeout.parser~, using the ~gpslogger~ android app, or through
|
|
geolocating ~my.ip~ addresses. If you only plan on using one the modules, you
|
|
can just import from the individual module, (e.g. ~my.google.takeout.parser~)
|
|
or you can disable the others using the ~core~ config -- See the
|
|
[[https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy][MODULE_DESIGN]] docs for more details.
|
|
|
|
* Configs
|
|
|
|
The config snippets below are meant to be modified accordingly and *pasted into your private configuration*, e.g =$MY_CONFIG/my/config.py=.
|
|
|
|
You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.
|
|
|
|
For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[https://github.com/seanbreckenridge/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]
|
|
|
|
# Nested Configurations before the doc generation using the block below
|
|
** [[file:../my/reddit][my.reddit]]
|
|
|
|
Reddit data: saved items/comments/upvotes/etc.
|
|
|
|
# Note: can't be generated as easily since this is a nested configuration object
|
|
#+begin_src python
|
|
class reddit:
|
|
class rexport:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
|
|
'''
|
|
|
|
# path[s]/glob to the exported JSON data
|
|
export_path: Paths
|
|
|
|
class pushshift:
|
|
'''
|
|
Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
|
|
'''
|
|
|
|
# path[s]/glob to the exported JSON data
|
|
export_path: Paths
|
|
|
|
#+end_src
|
|
|
|
** [[file:../my/browser/][my.browser]]
|
|
|
|
Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]]
|
|
|
|
#+begin_src python
|
|
class browser:
|
|
class export:
|
|
# path[s]/glob to your backed up browser history sqlite files
|
|
export_path: Paths
|
|
|
|
class active_browser:
|
|
# paths to sqlite database files which you use actively
|
|
# to read from. For example:
|
|
# from browserexport.browsers.all import Firefox
|
|
# active_databases = Firefox.locate_database()
|
|
export_path: Paths
|
|
#+end_src
|
|
** [[file:../my/location][my.location]]
|
|
|
|
Merged location history from lots of sources.
|
|
|
|
The main sources here are
|
|
[[https://github.com/mendhak/gpslogger][gpslogger]] .gpx (XML) files, and
|
|
google takeout (using =my.google.takeout.parser=), with a fallback on
|
|
manually defined home locations.
|
|
|
|
You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to
|
|
provide geolocation data for an IPs (though no IPs are provided from any
|
|
of the sources here). For an example of usage, see [[https://github.com/seanbreckenridge/HPI/tree/master/my/ip][here]]
|
|
|
|
#+begin_src python
|
|
class location:
|
|
home = (
|
|
# supports ISO strings
|
|
('2005-12-04' , (42.697842, 23.325973)), # Bulgaria, Sofia
|
|
# supports date/datetime objects
|
|
(date(year=1980, month=2, day=15) , (40.7128 , -74.0060 )), # NY
|
|
(datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558 , 37.6173 )), # Moscow, Russia
|
|
)
|
|
# note: order doesn't matter, will be sorted in the data provider
|
|
|
|
class gpslogger:
|
|
# path[s]/glob to the exported gpx files
|
|
export_path: Paths
|
|
|
|
# default accuracy for gpslogger
|
|
accuracy: float = 50.0
|
|
|
|
class via_ip:
|
|
# guess ~15km accuracy for IP addresses
|
|
accuracy: float = 15_000
|
|
#+end_src
|
|
** [[file:../my/time/tz/via_location.py][my.time.tz.via_location]]
|
|
|
|
Uses the =my.location= module to determine the timezone for a location.
|
|
|
|
This can be used to 'localize' timezones. Most modules here return
|
|
datetimes in UTC, to prevent confusion whether or not its a local
|
|
timezone, one from UTC, or one in your timezone.
|
|
|
|
Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.:
|
|
- if your objects already have tz info, you might not need to call localize() at all
|
|
- it's safer when either all of your objects are tz aware or all are tz unware, not a mixture
|
|
- you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable
|
|
|
|
#+begin_src python
|
|
TzPolicy = Literal[
|
|
'keep' , # if datetime is tz aware, just preserve it
|
|
'convert', # if datetime is tz aware, convert to provider's tz
|
|
'throw' , # if datetime is tz aware, throw exception
|
|
]
|
|
#+end_src
|
|
|
|
This is still a work in progress, plan is to integrate it with =hpi query=
|
|
so that you can easily convert/localize timezones for some module/data
|
|
|
|
#+begin_src python
|
|
class time:
|
|
class tz:
|
|
policy = 'keep'
|
|
|
|
class via_location:
|
|
# less precise, but faster
|
|
fast: bool = True
|
|
|
|
# if the accuracy for the location is more than 5km (this
|
|
# isn't an accurate location, so shouldn't use it to determine
|
|
# timezone), don't use
|
|
require_accuracy: float = 5_000
|
|
#+end_src
|
|
|
|
|
|
# TODO hmm. drawer raw means it can output outlines, but then have to manually erase the generated results. ugh.
|
|
|
|
#+begin_src python :dir .. :results output drawer raw :exports result
|
|
# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
|
|
# yep.. https://stackoverflow.com/q/41203765/706389
|
|
import importlib
|
|
# from lint import all_modules # meh
|
|
# TODO figure out how to discover configs automatically...
|
|
modules = [
|
|
('google' , 'my.google.takeout.parser'),
|
|
('hypothesis' , 'my.hypothesis' ),
|
|
('pocket' , 'my.pocket' ),
|
|
('twint' , 'my.twitter.twint' ),
|
|
('twitter_archive', 'my.twitter.archive' ),
|
|
('lastfm' , 'my.lastfm' ),
|
|
('polar' , 'my.polar' ),
|
|
('instapaper' , 'my.instapaper' ),
|
|
('github' , 'my.github.gdpr' ),
|
|
('github' , 'my.github.ghexport' ),
|
|
('kobo' , 'my.kobo' ),
|
|
]
|
|
|
|
def indent(s, spaces=4):
|
|
return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))
|
|
|
|
from pathlib import Path
|
|
import inspect
|
|
from dataclasses import fields
|
|
import re
|
|
print('\n') # ugh. hack for org-ruby drawers bug
|
|
for cls, p in modules:
|
|
m = importlib.import_module(p)
|
|
C = getattr(m, cls)
|
|
src = inspect.getsource(C)
|
|
i = src.find('@property')
|
|
if i != -1:
|
|
src = src[:i]
|
|
src = src.strip()
|
|
src = re.sub(r'(class \w+)\(.*', r'\1:', src)
|
|
mpath = p.replace('.', '/')
|
|
for x in ['.py', '__init__.py']:
|
|
if Path(mpath + x).exists():
|
|
mpath = mpath + x
|
|
print(f'** [[file:../{mpath}][{p}]]')
|
|
mdoc = m.__doc__
|
|
if mdoc is not None:
|
|
print(indent(mdoc))
|
|
print(f' #+begin_src python')
|
|
print(indent(src))
|
|
print(f' #+end_src')
|
|
#+end_src
|
|
|
|
#+RESULTS:
|
|
|
|
** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]]
|
|
|
|
Parses Google Takeout using [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]]
|
|
|
|
See [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts
|
|
|
|
If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't
|
|
cache individual exports in =~/.cache/google_takeout_parser=
|
|
|
|
The directory set as takeout_path can be unpacked directories, or
|
|
zip files of the exports, which are temporarily unpacked while creating
|
|
the cachew cache
|
|
|
|
#+begin_src python
|
|
class google(user_config):
|
|
# directory which includes unpacked/zipped takeouts
|
|
takeout_path: Paths
|
|
|
|
error_policy: ErrorPolicy = 'yield'
|
|
|
|
# experimental flag to use core.kompress.ZipPath
|
|
# instead of unpacking to a tmp dir via match_structure
|
|
_use_zippath: bool = False
|
|
#+end_src
|
|
** [[file:../my/hypothesis.py][my.hypothesis]]
|
|
|
|
[[https://hypothes.is][Hypothes.is]] highlights and annotations
|
|
|
|
#+begin_src python
|
|
class hypothesis:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
|
|
'''
|
|
|
|
# paths[s]/glob to the exported JSON data
|
|
export_path: Paths
|
|
#+end_src
|
|
** [[file:../my/pocket.py][my.pocket]]
|
|
|
|
[[https://getpocket.com][Pocket]] bookmarks and highlights
|
|
|
|
#+begin_src python
|
|
class pocket:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
|
|
'''
|
|
|
|
# paths[s]/glob to the exported JSON data
|
|
export_path: Paths
|
|
#+end_src
|
|
** [[file:../my/twitter/twint.py][my.twitter.twint]]
|
|
|
|
Twitter data (tweets and favorites).
|
|
|
|
Uses [[https://github.com/twintproject/twint][Twint]] data export.
|
|
|
|
Requirements: =pip3 install --user dataset=
|
|
|
|
#+begin_src python
|
|
class twint:
|
|
export_path: Paths # path[s]/glob to the twint Sqlite database
|
|
#+end_src
|
|
** [[file:../my/twitter/archive.py][my.twitter.archive]]
|
|
|
|
Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])
|
|
|
|
#+begin_src python
|
|
class twitter_archive:
|
|
export_path: Paths # path[s]/glob to the twitter archive takeout
|
|
#+end_src
|
|
** [[file:../my/lastfm][my.lastfm]]
|
|
|
|
Last.fm scrobbles
|
|
|
|
#+begin_src python
|
|
class lastfm:
|
|
"""
|
|
Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
|
|
"""
|
|
export_path: Paths
|
|
#+end_src
|
|
** [[file:../my/polar.py][my.polar]]
|
|
|
|
[[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights
|
|
|
|
#+begin_src python
|
|
class polar:
|
|
'''
|
|
Polar config is optional, you only need it if you want to specify custom 'polar_dir'
|
|
'''
|
|
polar_dir: PathIsh = Path('~/.polar').expanduser()
|
|
defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)
|
|
#+end_src
|
|
** [[file:../my/instapaper.py][my.instapaper]]
|
|
|
|
[[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations
|
|
|
|
#+begin_src python
|
|
class instapaper:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
|
|
'''
|
|
# path[s]/glob to the exported JSON data
|
|
export_path : Paths
|
|
#+end_src
|
|
** [[file:../my/github/gdpr.py][my.github.gdpr]]
|
|
|
|
Github data (uses [[https://github.com/settings/admin][official GDPR export]])
|
|
|
|
#+begin_src python
|
|
class github:
|
|
gdpr_dir: PathIsh # path to unpacked GDPR archive
|
|
#+end_src
|
|
** [[file:../my/github/ghexport.py][my.github.ghexport]]
|
|
|
|
Github data: events, comments, etc. (API data)
|
|
|
|
#+begin_src python
|
|
class github:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
|
|
'''
|
|
# path[s]/glob to the exported JSON data
|
|
export_path: Paths
|
|
|
|
# path to a cache directory
|
|
# if omitted, will use /tmp
|
|
cache_dir: Optional[PathIsh] = None
|
|
#+end_src
|
|
** [[file:../my/kobo.py][my.kobo]]
|
|
|
|
[[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats
|
|
|
|
#+begin_src python
|
|
class kobo:
|
|
'''
|
|
Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
|
|
'''
|
|
# path[s]/glob to the exported databases
|
|
export_path: Paths
|
|
#+end_src
|