This file is an overview of *documented* modules (which I'm progressively expanding). There are many more, see: - [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules. - you can also run =hpi modules= to list what's available on your system - [[https://github.com/karlicoss/HPI][source code]] is always the primary source of truth If you have some issues with the setup, see [[file:SETUP.org::#troubleshooting]["Troubleshooting"]]. * TOC :PROPERTIES: :TOC: :include all :END: :CONTENTS: - [[#toc][TOC]] - [[#intro][Intro]] - [[#configs][Configs]] - [[#mygoogletakeoutparser][my.google.takeout.parser]] - [[#myhypothesis][my.hypothesis]] - [[#myreddit][my.reddit]] - [[#mybrowser][my.browser]] - [[#mylocation][my.location]] - [[#mytimetzvia_location][my.time.tz.via_location]] - [[#mypocket][my.pocket]] - [[#mytwittertwint][my.twitter.twint]] - [[#mytwitterarchive][my.twitter.archive]] - [[#mylastfm][my.lastfm]] - [[#mypolar][my.polar]] - [[#myinstapaper][my.instapaper]] - [[#mygithubgdpr][my.github.gdpr]] - [[#mygithubghexport][my.github.ghexport]] - [[#mykobo][my.kobo]] :END: * Intro See [[file:SETUP.org][SETUP]] to find out how to set up your own config. Some explanations: - =MY_CONFIG= is the path where you are keeping your private configuration (usually =~/.config/my/=) - [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths - [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path= - [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths. It's 'smart', allows you to be flexible about your config: - simple =str= or a =Path= - =/a/path/to/directory/=, so the module will consume all files from this directory - a list of files/directories (it will be flattened) - a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed) - empty string (e.g. ~export_path = ''~), this will prevent the module from consuming any data This can be useful for modules that merge multiple data sources (for example, =my.twitter= or =my.github=) Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]]. - if the field has a default value, you can omit it from your private config altogether For more thoughts on modules and their structure, see [[file:MODULE_DESIGN.org][MODULE_DESIGN]] * all.py Some modules have lots of different sources for data. For example, ~my.location~ (location data) has lots of possible sources -- from ~my.google.takeout.parser~, using the ~gpslogger~ android app, or through geolocating ~my.ip~ addresses. If you only plan on using one the modules, you can just import from the individual module, (e.g. ~my.google.takeout.parser~) or you can disable the others using the ~core~ config -- See the [[https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy][MODULE_DESIGN]] docs for more details. * Configs The config snippets below are meant to be modified accordingly and *pasted into your private configuration*, e.g =$MY_CONFIG/my/config.py=. You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works. For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[https://github.com/seanbreckenridge/dotfiles/blob/master/.config/my/my/config/__init__.py][config]] # Nested Configurations before the doc generation using the block below ** [[file:../my/reddit][my.reddit]] Reddit data: saved items/comments/upvotes/etc. # Note: can't be generated as easily since this is a nested configuration object #+begin_src python class reddit: class rexport: ''' Uses [[https://github.com/karlicoss/rexport][rexport]] output. ''' # path[s]/glob to the exported JSON data export_path: Paths class pushshift: ''' Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments ''' # path[s]/glob to the exported JSON data export_path: Paths #+end_src ** [[file:../my/browser/][my.browser]] Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]] #+begin_src python class browser: class export: # path[s]/glob to your backed up browser history sqlite files export_path: Paths class active_browser: # paths to sqlite database files which you use actively # to read from. For example: # from browserexport.browsers.all import Firefox # export_path = Firefox.locate_database() export_path: Paths #+end_src ** [[file:../my/location][my.location]] Merged location history from lots of sources. The main sources here are [[https://github.com/mendhak/gpslogger][gpslogger]] .gpx (XML) files, and google takeout (using =my.google.takeout.parser=), with a fallback on manually defined home locations. You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to provide geolocation data for an IPs (though no IPs are provided from any of the sources here). For an example of usage, see [[https://github.com/seanbreckenridge/HPI/tree/master/my/ip][here]] #+begin_src python class location: home = ( # supports ISO strings ('2005-12-04' , (42.697842, 23.325973)), # Bulgaria, Sofia # supports date/datetime objects (date(year=1980, month=2, day=15) , (40.7128 , -74.0060 )), # NY (datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558 , 37.6173 )), # Moscow, Russia ) # note: order doesn't matter, will be sorted in the data provider class gpslogger: # path[s]/glob to the exported gpx files export_path: Paths # default accuracy for gpslogger accuracy: float = 50.0 class via_ip: # guess ~15km accuracy for IP addresses accuracy: float = 15_000 #+end_src ** [[file:../my/time/tz/via_location.py][my.time.tz.via_location]] Uses the =my.location= module to determine the timezone for a location. This can be used to 'localize' timezones. Most modules here return datetimes in UTC, to prevent confusion whether or not its a local timezone, one from UTC, or one in your timezone. Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.: - if your objects already have tz info, you might not need to call localize() at all - it's safer when either all of your objects are tz aware or all are tz unware, not a mixture - you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable #+begin_src python TzPolicy = Literal[ 'keep' , # if datetime is tz aware, just preserve it 'convert', # if datetime is tz aware, convert to provider's tz 'throw' , # if datetime is tz aware, throw exception ] #+end_src This is still a work in progress, plan is to integrate it with =hpi query= so that you can easily convert/localize timezones for some module/data #+begin_src python class time: class tz: policy = 'keep' class via_location: # less precise, but faster fast: bool = True # sort locations by date # in case multiple sources provide them out of order sort_locations: bool = True # if the accuracy for the location is more than 5km (this # isn't an accurate location, so shouldn't use it to determine # timezone), don't use require_accuracy: float = 5_000 #+end_src # TODO hmm. drawer raw means it can output outlines, but then have to manually erase the generated results. ugh. #+begin_src python :dir .. :results output drawer raw :exports result # TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive?? # yep.. https://stackoverflow.com/q/41203765/706389 import importlib # from lint import all_modules # meh # TODO figure out how to discover configs automatically... modules = [ ('google' , 'my.google.takeout.parser'), ('hypothesis' , 'my.hypothesis' ), ('pocket' , 'my.pocket' ), ('twint' , 'my.twitter.twint' ), ('twitter_archive', 'my.twitter.archive' ), ('lastfm' , 'my.lastfm' ), ('polar' , 'my.polar' ), ('instapaper' , 'my.instapaper' ), ('github' , 'my.github.gdpr' ), ('github' , 'my.github.ghexport' ), ('kobo' , 'my.kobo' ), ] def indent(s, spaces=4): return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True)) from pathlib import Path import inspect from dataclasses import fields import re print('\n') # ugh. hack for org-ruby drawers bug for cls, p in modules: m = importlib.import_module(p) C = getattr(m, cls) src = inspect.getsource(C) i = src.find('@property') if i != -1: src = src[:i] src = src.strip() src = re.sub(r'(class \w+)\(.*', r'\1:', src) mpath = p.replace('.', '/') for x in ['.py', '__init__.py']: if Path(mpath + x).exists(): mpath = mpath + x print(f'** [[file:../{mpath}][{p}]]') mdoc = m.__doc__ if mdoc is not None: print(indent(mdoc)) print(f' #+begin_src python') print(indent(src)) print(f' #+end_src') #+end_src #+RESULTS: ** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]] Parses Google Takeout using [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] See [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't cache individual exports in =~/.cache/google_takeout_parser= The directory set as takeout_path can be unpacked directories, or zip files of the exports, which are temporarily unpacked while creating the cachew cache #+begin_src python class google(user_config): # directory which includes unpacked/zipped takeouts takeout_path: Paths error_policy: ErrorPolicy = 'yield' # experimental flag to use core.kompress.ZipPath # instead of unpacking to a tmp dir via match_structure _use_zippath: bool = False #+end_src ** [[file:../my/hypothesis.py][my.hypothesis]] [[https://hypothes.is][Hypothes.is]] highlights and annotations #+begin_src python class hypothesis: ''' Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs ''' # paths[s]/glob to the exported JSON data export_path: Paths #+end_src ** [[file:../my/pocket.py][my.pocket]] [[https://getpocket.com][Pocket]] bookmarks and highlights #+begin_src python class pocket: ''' Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs ''' # paths[s]/glob to the exported JSON data export_path: Paths #+end_src ** [[file:../my/twitter/twint.py][my.twitter.twint]] Twitter data (tweets and favorites). Uses [[https://github.com/twintproject/twint][Twint]] data export. Requirements: =pip3 install --user dataset= #+begin_src python class twint: export_path: Paths # path[s]/glob to the twint Sqlite database #+end_src ** [[file:../my/twitter/archive.py][my.twitter.archive]] Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]]) #+begin_src python class twitter_archive: export_path: Paths # path[s]/glob to the twitter archive takeout #+end_src ** [[file:../my/lastfm][my.lastfm]] Last.fm scrobbles #+begin_src python class lastfm: """ Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs """ export_path: Paths #+end_src ** [[file:../my/polar.py][my.polar]] [[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights #+begin_src python class polar: ''' Polar config is optional, you only need it if you want to specify custom 'polar_dir' ''' polar_dir: PathIsh = Path('~/.polar').expanduser() defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging) #+end_src ** [[file:../my/instapaper.py][my.instapaper]] [[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations #+begin_src python class instapaper: ''' Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs. ''' # path[s]/glob to the exported JSON data export_path : Paths #+end_src ** [[file:../my/github/gdpr.py][my.github.gdpr]] Github data (uses [[https://github.com/settings/admin][official GDPR export]]) #+begin_src python class github: gdpr_dir: PathIsh # path to unpacked GDPR archive #+end_src ** [[file:../my/github/ghexport.py][my.github.ghexport]] Github data: events, comments, etc. (API data) #+begin_src python class github: ''' Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs. ''' # path[s]/glob to the exported JSON data export_path: Paths # path to a cache directory # if omitted, will use /tmp cache_dir: Optional[PathIsh] = None #+end_src ** [[file:../my/kobo.py][my.kobo]] [[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats #+begin_src python class kobo: ''' Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs. ''' # path[s]/glob to the exported databases export_path: Paths #+end_src