Some thoughts on modules, how to structure them, and adding your own/extending HPI

This is slightly more advanced, and would be useful if you're trying to extend HPI by developing your own modules, or contributing back to HPI

* TOC
:PROPERTIES:
:TOC:      :include all :depth 1 :force (nothing) :ignore (this) :local (nothing)
:END:
:CONTENTS:
- [[#allpy][all.py]]
- [[#module-count][module count]]
- [[#single-file-modules][single file modules]]
- [[#adding-new-modules][Adding new modules]]
- [[#an-extendable-module-structure][An Extendable module structure]]
- [[#logging-guidelines][Logging guidelines]]
:END:

* all.py

Some modules have lots of different sources for data. For example, ~my.location~ (location data) has lots of possible sources -- from ~my.google.takeout.parser~, using the ~gpslogger~ android app, or through geo locating ~my.ip~ addresses. For a module with multiple possible sources, its common to split it into files like:

    #+begin_src
    my/location
    ├── all.py -- specifies all possible sources/combines/merges data
    ├── common.py  -- defines shared code, e.g. to merge data from across entries, a shared model (namedtuple/dataclass) or protocol
    ├── google_takeout.py  -- source for data using my.google.takeout.parser
    ├── gpslogger.py -- source for data using gpslogger
    ├── home.py -- fallback source
    └── via_ip.py -- source using my.ip
    #+end_src

Its common for each of those sources to have their own file, like ~my.location.google_takeout~, ~my.location.gpslogger~ and ~my.location.via_ip~, and then they all get merged into a single function in ~my.location.all~, like:

    #+begin_src python
    from .common import Location

    def locations() -> Iterator[Location]:
        # can add/comment out sources here to enable/disable them
        yield from _takeout_locations()
        yield from _gpslogger_locations()


    @import_source(module_name="my.location.google_takeout")
    def _takeout_locations() -> Iterator[Location]:
        from . import google_takeout
        yield from google_takeout.locations()


    @import_source(module_name="my.location.gpslogger")
    def _gpslogger_locations() -> Iterator[Location]:
        from . import gpslogger
        yield from gpslogger.locations()
    #+end_src

If you want to disable a source, you have a few options.

  - If you're using a local editable install or just want to quickly troubleshoot, you can just comment out the line in the ~locations~ function
  - Since these are decorated behind ~import_source~, they automatically catch import/config errors, so instead of fatally erroring and crashing if you don't have a module setup, it'll warn you and continue to process the other sources. To get rid of the warnings, you can add the module you're not planning on using to your core config, like:

#+begin_src python
    class core:
        disabled_modules = (
            "my.location.gpslogger",
            "my.location.via_ip",
        )
#+end_src

... that suppresses the warning message and lets you use ~my.location.all~ without having to change any lines of code

Another benefit is that all the custom sources/data is localized to the ~all.py~ file, so a user can override the ~all.py~ (see the sections below on ~namespace packages~) file in their own HPI repository, adding additional sources without having to maintain a fork and patching in changes as things eventually change. For a 'real world' example of that, see [[https://github.com/purarue/HPI#partially-in-usewith-overrides][purarue]]s location and ip modules.

This is of course not required for personal or single file modules, its just the pattern that seems to have the least amount of friction for the user, while being extendable, and without using a bulky plugin system to let users add additional sources.

Another common way an ~all.py~ file is used is to merge data from a periodic export, and a GDPR export (e.g. see the ~stackexchange~, or ~github~ modules)

* module count

 Having way too many modules could end up being an issue. For now, I'm basically happy to merge new modules - With the current module count, things don't seem to break much, and most of them are modules I use myself, so they get tested with my own data.

 For services I don't use, I would prefer if they had tests/example data somewhere, else I can't guarantee they're still working...

 Its great if when you start using HPI, you get a few modules 'for free' (perhaps ~github~ and ~reddit~), but its likely not everyone uses the same services

 This shouldn't end up becoming a monorepo (a la [[https://www.spacemacs.org/][Spacemacs]]) with hundreds of modules supporting every use case. Its hard to know what the common usecase is for everyone, and new services/companies which silo your data appear all the time...

 Its also not obvious how people want to access their data. This problem is often mitigated by the output of HPI being python functions -- one can always write a small script to take the output data from a module and wrangle it into some format you want

 This is why HPI aims to be as extendable as possible. If you have some programming know-how, hopefully you're able to create some basic modules for yourself - plug in your own data and gain the benefits of using the functions in ~my.core~, the configuration layer and possibly libraries like [[https://github.com/karlicoss/cachew][cachew]] to 'automatically' cache your data

 In some ways it may make sense to think of HPI as akin to emacs or a ones 'dotfiles'. This provides a configuration layer and structure for you to access your data, and you can extend it to your own use case.

* single file modules

... or, the question 'should we split code from individual HPI files into setuptools packages'

It's possible for a single HPI module or file to handle *everything*. Most of the python files in ~my/~ are 'single file' modules

By everything, I mean:

 - Exporting data from an API/locating data on your disk/maybe saving data so you don't lose it
 - Parsing data from some raw (JSON/SQLite/HTML) format
 - Merging different data sources into some common =NamedTuple=-like schema
 - caching expensive computation/merge results
 - configuration through ~my.config~

For short modules which aren't that complex, while developing your own personal modules, or while bootstrapping modules - this is actually fine.

From a users perspective, the ability to clone and install HPI as editable, add an new python file into ~my/~, and it immediately be accessible as ~my.modulename~ is a pattern that should always be supported

However, as modules get more and more complex, especially if they include backing up/locating data from some location on your filesystem or interacting with a live API -- ideally they should be split off into their own repositories. There are trade-offs to doing this, but they are typically worth it.

As an example of this, take a look at the [[https://github.com/karlicoss/HPI/tree/5ef277526577daaa115223e79a07a064ffa9bc85/my/github][my.github]] and the corresponding [[https://github.com/karlicoss/ghexport][ghexport]] data exporter which saves github data.

- Pros:
  - This allows someone to install and use ~ghexport~ without having to setup HPI at all -- its a standalone tool which means there's less barrier to entry
  - It being a separate repository means issues relating to exporting data and the [[https://beepb00p.xyz/exports.html#dal][DAL]] (loading the data) can be handled there, instead of in HPI
  - This reduces complexity for someone looking at the ~my.github~ files trying to debug issues related to HPI. The functionality for ~ghexport~ can be tested independently of someone new to HPI trying to debug a configuration issue
  - Is easier to combine additional data sources, like ~my.github.gdpr~, which includes additional data from the GDPR export

- Cons:
  - Leads to some code duplication, as you can no longer use helper functions from ~my.core~ in the new repository
  - Additional boilerplate - instructions, installation scripts, testing. It's not required, but typically you want to leverage ~setuptools~ to allows ~pip install git+https...~ type installs, which are used in ~hpi module install~
  - Is difficult to convert to a namespace module/directory down the road

Not all HPI Modules are currently at that level of complexity -- some are simple enough that one can understand the file by just reading it top to bottom. Some wouldn't make sense to split off into separate modules for one reason or another.

A related concern is how to structure namespace packages to allow users to easily extend them, and how this conflicts with single file modules (Keep reading below for more information on namespace packages/extension) If a module is converted from a single file module to a namespace with multiple files, it seems this is a breaking change, see [[https://github.com/karlicoss/HPI/issues/89][#89]] for an example of this. The current workaround is to leave it a regular python package with an =__init__.py= for some amount of time and send a deprecation warning, and then eventually remove the =__init__.py= file to convert it into a namespace package. For an example, see the [[https://github.com/karlicoss/HPI/blob/8422c6e420f5e274bd1da91710663be6429c666c/my/reddit/__init__.py][reddit init file]].

Its quite a pain to have to convert a file from a single file module to a namespace module, so if there's *any* possibility that you might convert it to a namespace package, might as well just start it off as one, to avoid the pain down the road. As an example, say you were creating something to parse ~zsh~ history. Instead of creating ~my/zsh.py~, it would be better to create ~my/zsh/parser.py~. That lets users override the file using editable/namespace packages, and it also means in the future its much more trivial to extend it to something like:

  #+begin_src
  my/zsh
  ├── all.py -- e.g. combined/unique/sorted zsh history
  ├── aliases.py -- parse zsh alias files
  ├── common.py  -- shared models/merging code
  ├── compdump.py -- parse zsh compdump files
  └── parser.py -- parse individual zsh history files
  #+end_src

There's no requirement to follow this entire structure when you start off, the entire module could live in ~my/zsh/parser.py~, including all the merging/parsing/locating code. It just avoids the trouble in the future, and the only downside is having to type a bit more when importing from it.

#+html: <div id="addingmodules"></div>

* Adding new modules

  As always, if the changes you wish to make are small, or you just want to add a few modules, you can clone and edit an editable install of HPI. See [[file:SETUP.org][SETUP]] for more information

  The "proper way" (unless you want to contribute to the upstream) is to create a separate file hierarchy and add your module to =PYTHONPATH= (or use 'editable namespace packages' as described below, which also modifies your computed ~sys.path~)

# TODO link to 'overlays' documentation?
  You can check my own [[https://github.com/karlicoss/hpi-personal-overlay][personal overlay]] as a reference.

  For example, if you want to add an =awesomedatasource=, it could be:

  : custom_module
  : └── my
  :     └──awesomedatasource.py

  You can use all existing HPI modules in =awesomedatasource.py=, including =my.config= and everything from =my.core=.
  =hpi modules= or =hpi doctor= commands should also detect your extra modules.

- In addition, you can *override* the builtin HPI modules too:

  : custom_lastfm_overlay
  : └── my
  :     └──lastfm.py

  Now if you add =custom_lastfm_overlay= [[https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH][*in front* of ~PYTHONPATH~]], all the downstream scripts using =my.lastfm= will load it from =custom_lastfm_overlay= instead.

  This could be useful to monkey patch some behaviours, or dynamically add some extra data sources -- anything that comes to your mind.
  You can check [[https://github.com/karlicoss/hpi-personal-overlay/blob/7fca8b1b6031bf418078da2d8be70fd81d2d8fa0/src/my/calendar/holidays.py#L1-L14][my.calendar.holidays]] in my personal overlay as a reference.

** Namespace Packages

Note: this section covers some of the complexities and benefits with this being a namespace package and/or editable install, so it assumes some familiarity with python/imports

HPI is installed as a namespace package, which allows an additional way to add your own modules. For the details on namespace packages, see [[https://www.python.org/dev/peps/pep-0420/][PEP420]], or the  [[https://packaging.python.org/guides/packaging-namespace-packages][packaging docs for a summary]], but for our use case, a sufficient description might be: Namespace packages let you split a package across multiple directories on disk.

Without adding a bulky/boilerplate-y plugin framework to HPI, as that increases the barrier to entry, [[https://packaging.python.org/guides/creating-and-discovering-plugins/#using-namespace-packages][namespace packages offers an alternative]] with little downsides.

Creating a separate file hierarchy still allows you to keep up to date with any changes from this repository by running ~git pull~ on your local clone of HPI periodically (assuming you've installed it as an editable package (~pip install -e .~)), while creating your own modules, and possibly overwriting any files you wish to override/overlay.

In order to do that, like stated above, you could edit the ~PYTHONPATH~ variable, which in turn modifies your computed ~sys.path~, which is how python [[https://docs.python.org/3/library/sys.html?highlight=pythonpath#sys.path][determines the search path for modules]]. This is sort of what [[file:../with_my][with_my]] allows you to do.

In the context of HPI, it being a namespace package means you can have a local clone of this repository, and your own 'HPI' modules in a separate folder, which then get combined into the ~my~ package.

As an example, say you were trying to override the ~my.lastfm~ file, to include some new feature. You could create a new file hierarchy like:

: .
: ├── my
: │   ├── lastfm.py
: │   └── some_new_module.py
: └── setup.py

Where ~lastfm.py~ is your version of ~my.lastfm~, which you've copied from this repository and applied your changes to. The ~setup.py~ would be something like:

    #+begin_src python
    from setuptools import setup, find_namespace_packages

    # should use a different name,
    # so its possible to differentiate between HPI installs
    setup(
        name=f"my-HPI-overlay",
        zip_safe=False,
        packages=find_namespace_packages(".", include=("my*")),
    )
    #+end_src

Then, running ~python3 -m pip install -e .~ in that directory would install that as part of the namespace package, and assuming (see below for possible issues) this appears on ~sys.path~ before the upstream repository, your ~lastfm.py~ file overrides the upstream. Adding more files, like ~my.some_new_module~ into that directory immediately updates the global ~my~ package -- allowing you to quickly add new modules without having to re-install.

If you install both directories as editable packages (which has the benefit of any changes you making in either repository immediately updating the globally installed ~my~ package), there are some concerns with which editable install appears on your ~sys.path~ first. If you wanted your modules to override the upstream modules, yours would have to appear on the ~sys.path~ first (this is the same reason that =custom_lastfm_overlay= must be at the front of your ~PYTHONPATH~). For more details and examples on dealing with editable namespace packages in the context of HPI, see the [[https://github.com/purarue/reorder_editable][reorder_editable]] repository.

There is no limit to how many directories you could install into a single namespace package, which could be a possible way for people to install additional HPI modules, without worrying about the module count here becoming too large to manage.

There are some other users [[https://github.com/hpi/hpi][who have begun publishing their own modules]] as namespace packages, which you could potentially install and use, in addition to this repository, if any of those interest you. If you want to create your own you can use the [[https://github.com/purarue/HPI-template][template]] to get started.

Though, enabling this many modules may make ~hpi doctor~ look pretty busy. You can explicitly choose to enable/disable modules with a list of modules/regexes in your [[https://github.com/karlicoss/HPI/blob/f559e7cb899107538e6c6bbcf7576780604697ef/my/core/core_config.py#L24-L55][core config]], see [[https://github.com/purarue/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L42-L72][here]] for an example.

You may use the other modules or [[https://github.com/karlicoss/hpi-personal-overlay][my overlay]] as reference, but python packaging is already a complicated issue, before adding complexities like namespace packages and editable installs on top of it... If you're having trouble extending HPI in this fashion, you can open an issue here, preferably with a link to your code/repository and/or ~setup.py~ you're trying to use.

* An Extendable module structure

In this context, 'overlay'/'override' means you create your own namespace package/file structure like described above, and since your files are in front of the upstream repository files in the computed ~sys.path~ (either by using namespace modules, the ~PYTHONPATH~ or ~with_my~), your file overrides the upstream repository

Related issues: [[https://github.com/karlicoss/HPI/issues/102][#102]], [[https://github.com/karlicoss/HPI/issues/89][#89]], [[https://github.com/karlicoss/HPI/issues/154][#154]]

The main goals are:

- low effort: ideally it should be a matter of a few lines of code to override something.
- good interop: e.g. ability to keep with the upstream, use modules coming from separate repositories, etc.
- ideally mypy friendly. This kind of means 'not too dynamic and magical', which is ultimately a good thing even if you don't care about mypy.

~all.py~ using modules/sources behind ~import_source~ is the solution we've arrived at in HPI, because it meets all of these goals:

 - it doesn't require an additional plugin system, is just python imports and
   namespace packages
 - is generally mypy friendly (the only exception is the ~import_source~
   decorator, but that typically returns nothing if the import failed)
 - doesn't require you to maintain a fork of this repository, though you can maintain a separate HPI repository (so no patching/merge conflicts)
 - allows you to easily add/remove sources to the ~all.py~ module, either by:
    - overriding an ~all.py~ in your own repository
    - just commenting out the source/adding 2 lines to import and ~yield from~ your new source
    - doing nothing! (~import_source~ will catch the error and just warn you
      and continue to work without changing any code)

It could be argued that namespace packages and editable installs are a bit complex for a new user to get the hang of, and this is true. But fortunately ~import_source~ means any user just using HPI only needs to follow the instructions when a warning is printed, or peruse the docs here a bit -- there's no need to clone or create your own override to just use the ~all.py~ file.

There's no requirement to use this for individual modules, it just seems to be the best solution we've arrived at so far

* Logging guidelines
HPI doesn't enforce any specific logging mechanism, you're free to use whatever you prefer in your modules.

However there are some general guidelines for developing modules that can make them more pleasant to use.

- each module should have its unique logger, the easiest way to ensure that is simply use module's ~__name__~ attribute as the logger name

  In addition, this ensures the logger hierarchy reflect the package hierarchy.
  For instance, if you initialize the logger for =my.module= with specific settings, the logger for =my.module.helper= would inherit these settings. See more on that [[ https://docs.python.org/3/library/logging.html?highlight=logging#logger-objects][in python docs]].

  As a bonus, if you use the module ~__name__~, this logger will be automatically be picked up and used by ~cachew~.

- often modules are processing multiple files, extracting data from each one ([[https://beepb00p.xyz/exports.html#types][incremental/synthetic exports]])

  It's nice to log each file name you're processing as =logger.info= so the user of module gets a sense of progress.
  If possible, add the index of file you're processing and the total count.

  #+begin_src python
  def process_all_data():
      paths = inputs()
      total = len(paths)
      width = len(str(total))
      for idx, path in enumerate(paths):
          # :>{width} to align the logs vertically
          logger.info(f'processing [{idx:>{width}}/{total:>{width}}] {path}')
          yield from process_path(path)
  #+end_src

  If there is a lot of logging happening related to a specific path, instead of adding path to each logging message manually, consider using [[https://docs.python.org/3/library/logging.html?highlight=loggeradapter#logging.LoggerAdapter][LoggerAdapter]].

- log exceptions, but sparingly

  Generally it's a good practice to call ~logging.exception~ from the ~except~ clause, so it's immediately visible where the errors are happening.

  However, in HPI, instead of crashing on exceptions we often behave defensively and ~yield~ them instead (see [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]]).

  In this case logging every time may become a bit spammy, so use exception logging sparingly in this case.
  Typically it's best to rely on the downstream data consumer to handle the exceptions properly.

- instead of =logging.getLogger=, it's best to use =my.core.make_logger=

  #+begin_src python
      from my.core import make_logger

      logger = make_logger(__name__)

      # or to set a custom level
      logger = make_logger(__name__, level='warning')
  #+end_src

  This sets up some nicer defaults over standard =logging= module:

  - colored logs (via =colorlog= library)
  - =INFO= as the initial logging level (instead of default =ERROR=)
  - logging full exception trace when even when logging outside of the exception handler

    This is particularly useful for [[https://beepb00p.xyz/mypy-error-handling.html][mypy assisted error handling]].

    By default, =logging= only logs the exception message (without the trace) in this case, which makes errors harder to debug.
  - control logging level from the shell via ~LOGGING_LEVEL_*~ env variable

    This can be useful to suppress logging output if it's too spammy, or showing more output for debugging.

    E.g. ~LOGGING_LEVEL_my_instagram_gdpr=DEBUG hpi query my.instagram.gdpr.messages~

  - experimental: passing env variable ~LOGGING_COLLAPSE=<loglevel>~ will "collapse" logging with the same level

    Instead of printing new logging line each time, it will 'redraw' the last logged line with a new logging message.

    This can be convenient if there are too many logs, you just need logging to get a sense of progress.

  - experimental: passing env variable ~ENLIGHTEN_ENABLE=yes~ will display TUI progress bars in some cases

    See [[https://github.com/Rockhopper-Technologies/enlighten#readme][https://github.com/Rockhopper-Technologies/enlighten#readme]]

    This can be convenient for showing the progress of parallel processing of different files from HPI:

    #+BEGIN_EXAMPLE
      ghexport.dal[111]  29%|████████████████████                          |  29/100 [00:03<00:07, 10.03 files/s]
      rexport.dal[comments]  17%|████████                                  | 115/682 [00:03<00:14, 39.15 files/s]
      my.instagram.android   0%|▎                                          |    3/2631 [00:02<34:50, 1.26 files/s]
    #+END_EXAMPLE