HPI/doc/MODULES.org

This file is an overview of *documented* modules (which I'm progressively expanding).

There are many more, see:

- [[file:../README.org::#whats-inside]["What's inside"]] for the full list of modules.
- you can also run =hpi modules= to list what's available on your system
- [[https://github.com/karlicoss/HPI][source code]] is always the primary source of truth

If you have some issues with the setup, see [[file:SETUP.org::#troubleshooting]["Troubleshooting"]].

* TOC
:PROPERTIES:
:TOC:      :include all
:END:
:CONTENTS:
- [[#toc][TOC]]
- [[#intro][Intro]]
- [[#configs][Configs]]
  - [[#mygoogletakeoutparser][my.google.takeout.parser]]
  - [[#myhypothesis][my.hypothesis]]
  - [[#myreddit][my.reddit]]
  - [[#mybrowser][my.browser]]
  - [[#mylocation][my.location]]
  - [[#mytimetzvia_location][my.time.tz.via_location]]
  - [[#mypocket][my.pocket]]
  - [[#mytwittertwint][my.twitter.twint]]
  - [[#mytwitterarchive][my.twitter.archive]]
  - [[#mylastfm][my.lastfm]]
  - [[#mypolar][my.polar]]
  - [[#myinstapaper][my.instapaper]]
  - [[#mygithubgdpr][my.github.gdpr]]
  - [[#mygithubghexport][my.github.ghexport]]
  - [[#mykobo][my.kobo]]
:END:

* Intro

See [[file:SETUP.org][SETUP]] to find out how to set up your own config.

Some explanations:

- =MY_CONFIG= is the path where you are keeping your private configuration (usually =~/.config/my/=)
- [[https://docs.python.org/3/library/pathlib.html#pathlib.Path][Path]] is a standard Python object to represent paths
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L9][PathIsh]] is a helper type to allow using either =str=, or a =Path=
- [[https://github.com/karlicoss/HPI/blob/5f4acfddeeeba18237e8b039c8f62bcaa62a4ac2/my/core/common.py#L108][Paths]] is another helper type for paths.

  It's 'smart', allows you to be flexible about your config:

  - simple =str= or a =Path=
  - =/a/path/to/directory/=, so the module will consume all files from this directory
  - a list of files/directories (it will be flattened)
  - a [[https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob][glob]] string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
  - empty string (e.g. ~export_path = ''~), this will prevent the module from consuming any data

    This can be useful for modules that merge multiple data sources (for example, =my.twitter= or =my.github=)

  Typically, such variable will be passed to =get_files= to actually extract the list of real files to use. You can see usage examples [[https://github.com/karlicoss/HPI/blob/master/tests/get_files.py][here]].

- if the field has a default value, you can omit it from your private config altogether

For more thoughts on modules and their structure, see [[file:MODULE_DESIGN.org][MODULE_DESIGN]]

* all.py

Some modules have lots of different sources for data. For example,
~my.location~ (location data) has lots of possible sources -- from
~my.google.takeout.parser~, using the ~gpslogger~ android app, or through
geolocating ~my.ip~ addresses. If you only plan on using one the modules, you
can just import from the individual module, (e.g. ~my.google.takeout.parser~)
or you can disable the others using the ~core~ config -- See the
[[https://github.com/karlicoss/HPI/blob/master/doc/MODULE_DESIGN.org#allpy][MODULE_DESIGN]] docs for more details.

* Configs

The config snippets below are meant to be modified accordingly and *pasted into your private configuration*, e.g =$MY_CONFIG/my/config.py=.

You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.

For an extensive/complex example, you can check out ~@seanbreckenridge~'s [[https://github.com/seanbreckenridge/dotfiles/blob/master/.config/my/my/config/__init__.py][config]]

# Nested Configurations before the doc generation using the block below
** [[file:../my/reddit][my.reddit]]

    Reddit data: saved items/comments/upvotes/etc.

    # Note: can't be generated as easily since this is a nested configuration object
    #+begin_src python
    class reddit:
        class rexport:
            '''
            Uses [[https://github.com/karlicoss/rexport][rexport]] output.
            '''

            # path[s]/glob to the exported JSON data
            export_path: Paths

        class pushshift:
            '''
            Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
            '''

            # path[s]/glob to the exported JSON data
            export_path: Paths

    #+end_src

** [[file:../my/browser/][my.browser]]

    Parses browser history using [[http://github.com/seanbreckenridge/browserexport][browserexport]]

    #+begin_src python
    class browser:
        class export:
            # path[s]/glob to your backed up browser history sqlite files
            export_path: Paths

        class active_browser:
            # paths to sqlite database files which you use actively
            # to read from. For example:
            # from browserexport.browsers.all import Firefox
            # active_databases = Firefox.locate_database()
            export_path: Paths
    #+end_src
** [[file:../my/location][my.location]]

   Merged location history from lots of sources.

   The main sources here are
   [[https://github.com/mendhak/gpslogger][gpslogger]] .gpx (XML) files, and
   google takeout (using =my.google.takeout.parser=), with a fallback on
   manually defined home locations.

   You might also be able to use [[file:../my/location/via_ip.py][my.location.via_ip]] which uses =my.ip.all= to
   provide geolocation data for an IPs (though no IPs are provided from any
 of the sources here). For an example of usage, see [[https://github.com/seanbreckenridge/HPI/tree/master/my/ip][here]]

    #+begin_src python
    class location:
        home = (
             # supports ISO strings
             ('2005-12-04'                                       , (42.697842, 23.325973)), # Bulgaria, Sofia
             # supports date/datetime objects
             (date(year=1980, month=2, day=15)                   , (40.7128  , -74.0060 )), # NY
             (datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558  , 37.6173  )), # Moscow, Russia
         )
         # note: order doesn't matter, will be sorted in the data provider

         class gpslogger:
             # path[s]/glob to the exported gpx files
              export_path: Paths

              # default accuracy for gpslogger
              accuracy: float = 50.0

          class via_ip:
              # guess ~15km accuracy for IP addresses
              accuracy: float = 15_000
    #+end_src
** [[file:../my/time/tz/via_location.py][my.time.tz.via_location]]

   Uses the =my.location= module to determine the timezone for a location.

   This can be used to 'localize' timezones. Most modules here return
   datetimes in UTC, to prevent confusion whether or not its a local
   timezone, one from UTC, or one in your timezone.

   Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.:
    - if your objects already have tz info, you might not need to call localize() at all
    - it's safer when either all of your objects are tz aware or all are tz unware, not a mixture
    - you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable

   #+begin_src python
   TzPolicy = Literal[
       'keep'   , # if datetime is tz aware, just preserve it
       'convert', # if datetime is tz aware, convert to provider's tz
       'throw'  , # if datetime is tz aware, throw exception
   ]
   #+end_src

   This is still a work in progress, plan is to integrate it with =hpi query=
   so that you can easily convert/localize timezones for some module/data

   #+begin_src python
   class time:
       class tz:
           policy = 'keep'

           class via_location:
               # less precise, but faster
               fast: bool = True

               # if the accuracy for the location is more than 5km (this
               # isn't an accurate location, so shouldn't use it to determine
               # timezone), don't use
               require_accuracy: float = 5_000
    #+end_src


# TODO hmm. drawer raw means it can output outlines, but then have to manually erase the generated results. ugh.

#+begin_src python :dir .. :results output drawer raw :exports result
# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
# yep.. https://stackoverflow.com/q/41203765/706389
import importlib
# from lint import all_modules # meh
# TODO figure out how to discover configs automatically...
modules = [
    ('google'         , 'my.google.takeout.parser'),
    ('hypothesis'     , 'my.hypothesis'           ),
    ('pocket'         , 'my.pocket'               ),
    ('twint'          , 'my.twitter.twint'        ),
    ('twitter_archive', 'my.twitter.archive'      ),
    ('lastfm'         , 'my.lastfm'               ),
    ('polar'          , 'my.polar'                ),
    ('instapaper'     , 'my.instapaper'           ),
    ('github'         , 'my.github.gdpr'          ),
    ('github'         , 'my.github.ghexport'      ),
    ('kobo'           , 'my.kobo'                 ),
]

def indent(s, spaces=4):
    return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))

from pathlib import Path
import inspect
from dataclasses import fields
import re
print('\n') # ugh. hack for org-ruby drawers bug
for cls, p in modules:
    m = importlib.import_module(p)
    C = getattr(m, cls)
    src = inspect.getsource(C)
    i = src.find('@property')
    if i != -1:
        src = src[:i]
    src = src.strip()
    src = re.sub(r'(class \w+)\(.*', r'\1:', src)
    mpath = p.replace('.', '/')
    for x in ['.py', '__init__.py']:
        if Path(mpath + x).exists():
            mpath = mpath + x
    print(f'** [[file:../{mpath}][{p}]]')
    mdoc = m.__doc__
    if mdoc is not None:
        print(indent(mdoc))
    print(f'    #+begin_src python')
    print(indent(src))
    print(f'    #+end_src')
#+end_src

#+RESULTS:

** [[file:../my/google/takeout/parser.py][my.google.takeout.parser]]

      Parses Google Takeout using [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]]

      See [[https://github.com/seanbreckenridge/google_takeout_parser][google_takeout_parser]] for more information about how to export and organize your takeouts

      If the =DISABLE_TAKEOUT_CACHE= environment variable is set, this won't
      cache individual exports in =~/.cache/google_takeout_parser=

      The directory set as takeout_path can be unpacked directories, or
      zip files of the exports, which are temporarily unpacked while creating
      the cachew cache

    #+begin_src python
    class google(user_config):
        # directory which includes unpacked/zipped takeouts
        takeout_path: Paths

        error_policy: ErrorPolicy = 'yield'

        # experimental flag to use core.kompress.ZipPath
        # instead of unpacking to a tmp dir via match_structure
        _use_zippath: bool = False
    #+end_src
** [[file:../my/hypothesis.py][my.hypothesis]]

    [[https://hypothes.is][Hypothes.is]] highlights and annotations

    #+begin_src python
    class hypothesis:
        '''
        Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
        '''

        # paths[s]/glob to the exported JSON data
        export_path: Paths
    #+end_src
** [[file:../my/pocket.py][my.pocket]]

    [[https://getpocket.com][Pocket]] bookmarks and highlights

    #+begin_src python
    class pocket:
        '''
        Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
        '''

        # paths[s]/glob to the exported JSON data
        export_path: Paths
    #+end_src
** [[file:../my/twitter/twint.py][my.twitter.twint]]

    Twitter data (tweets and favorites).

    Uses [[https://github.com/twintproject/twint][Twint]] data export.

    Requirements: =pip3 install --user dataset=

    #+begin_src python
    class twint:
        export_path: Paths # path[s]/glob to the twint Sqlite database
    #+end_src
** [[file:../my/twitter/archive.py][my.twitter.archive]]

    Twitter data (uses [[https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive][official twitter archive export]])

    #+begin_src python
    class twitter_archive:
        export_path: Paths # path[s]/glob to the twitter archive takeout
    #+end_src
** [[file:../my/lastfm][my.lastfm]]

    Last.fm scrobbles

    #+begin_src python
    class lastfm:
        """
        Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
        """
        export_path: Paths
    #+end_src
** [[file:../my/polar.py][my.polar]]

    [[https://github.com/burtonator/polar-bookshelf][Polar]] articles and highlights

    #+begin_src python
    class polar:
        '''
        Polar config is optional, you only need it if you want to specify custom 'polar_dir'
        '''
        polar_dir: PathIsh = Path('~/.polar').expanduser()
        defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)
    #+end_src
** [[file:../my/instapaper.py][my.instapaper]]

    [[https://www.instapaper.com][Instapaper]] bookmarks, highlights and annotations

    #+begin_src python
    class instapaper:
        '''
        Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
        '''
        # path[s]/glob to the exported JSON data
        export_path : Paths
    #+end_src
** [[file:../my/github/gdpr.py][my.github.gdpr]]

    Github data (uses [[https://github.com/settings/admin][official GDPR export]])

    #+begin_src python
    class github:
        gdpr_dir: PathIsh  # path to unpacked GDPR archive
    #+end_src
** [[file:../my/github/ghexport.py][my.github.ghexport]]

    Github data: events, comments, etc. (API data)

    #+begin_src python
    class github:
        '''
        Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
        '''
        # path[s]/glob to the exported JSON data
        export_path: Paths

        # path to a cache directory
        # if omitted, will use /tmp
        cache_dir: Optional[PathIsh] = None
    #+end_src
** [[file:../my/kobo.py][my.kobo]]

    [[https://uk.kobobooks.com/products/kobo-aura-one][Kobo]] e-ink reader: annotations and reading stats

    #+begin_src python
    class kobo:
        '''
        Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
        '''
        # path[s]/glob to the exported databases
        export_path: Paths
    #+end_src