HPI/doc/MODULES.org at 66a00c6ada841088fd53934d756cc879cca573ec

fz0x1/HPI

Fork 0

Sean Breckenridge 66a00c6ada docs: add docs for google_takeout_parser

2022-04-25 02:52:34 +01:00

11 KiB

Raw Blame History

TOC
Intro
Configs

This file is an overview of documented modules (which I'm progressively expanding).

There are many more, see:

"What's inside" for the full list of modules.
you can also run hpi modules to list what's available on your system
source code is always the primary source of truth

If you have some issues with the setup, see "Troubleshooting".

Intro

See SETUP to find out how to set up your own config.

Some explanations:

MY_CONFIG is the path where you are keeping your private configuration (usually ~/.config/my/)
Path is a standard Python object to represent paths
PathIsh is a helper type to allow using either str, or a Path
Paths is another helper type for paths.

It's 'smart', allows you to be flexible about your config:
- simple str or a Path
- /a/path/to/directory/, so the module will consume all files from this directory
- a list of files/directories (it will be flattened)
- a glob string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
- empty string (e.g. export_path = ''), this will prevent the module from consuming any data This can be useful for modules that merge multiple data sources (for example, my.twitter or my.github)
Typically, such variable will be passed to get_files to actually extract the list of real files to use. You can see usage examples here.
if the field has a default value, you can omit it from your private config altogether

For more thoughts on modules and their structure, see MODULE_DESIGN

Configs

The config snippets below are meant to be modified accordingly and pasted into your private configuration, e.g $MY_CONFIG/my/config.py.

You don't have to set up all modules at once, it's recommended to do it gradually, to get the feel of how HPI works.

For an extensive/complex example, you can check out @seanbreckenridge's config

my.reddit

Reddit data: saved items/comments/upvotes/etc.

class reddit:
    class rexport:
        '''
        Uses [[https://github.com/karlicoss/rexport][rexport]] output.
        '''

        # path[s]/glob to the exported JSON data
        export_path: Paths

    class pushshift:
        '''
        Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
        '''

        # path[s]/glob to the exported JSON data
        export_path: Paths

my.browser

Parses browser history using browserexport

@dataclass
class browser:
    class export:
        # path[s]/glob to your backed up browser history sqlite files
        export_path: Paths

    class active_browser:
        # paths to sqlite database files which you use actively
        # to read from. For example:
        # from browserexport.browsers.all import Firefox
        # active_databases = Firefox.locate_database()
        export_path: Paths

# TODO ugh, pkgutil.walk_packages doesn't recurse and find packages like my.twitter.archive??
# yep.. https://stackoverflow.com/q/41203765/706389
import importlib
# from lint import all_modules # meh
# TODO figure out how to discover configs automatically...
modules = [
    ('google'         , 'my.google.takeout.parser'),
    ('hypothesis'     , 'my.hypothesis'           ),
    ('pocket'         , 'my.pocket'               ),
    ('twint'          , 'my.twitter.twint'        ),
    ('twitter_archive', 'my.twitter.archive'      ),
    ('lastfm'         , 'my.lastfm'               ),
    ('polar'          , 'my.polar'                ),
    ('instapaper'     , 'my.instapaper'           ),
    ('github'         , 'my.github.gdpr'          ),
    ('github'         , 'my.github.ghexport'      ),
    ('kobo'           , 'my.kobo'                 ),
]

def indent(s, spaces=4):
    return ''.join(' ' * spaces + l for l in s.splitlines(keepends=True))

from pathlib import Path
import inspect
from dataclasses import fields
import re
print('\n') # ugh. hack for org-ruby drawers bug
for cls, p in modules:
    m = importlib.import_module(p)
    C = getattr(m, cls)
    src = inspect.getsource(C)
    i = src.find('@property')
    if i != -1:
        src = src[:i]
    src = src.strip()
    src = re.sub(r'(class \w+)\(.*', r'\1:', src)
    mpath = p.replace('.', '/')
    for x in ['.py', '__init__.py']:
        if Path(mpath + x).exists():
            mpath = mpath + x
    print(f'** [[file:../{mpath}][{p}]]')
    mdoc = m.__doc__
    if mdoc is not None:
        print(indent(mdoc))
    print(f'    #+begin_src python')
    print(indent(src))
    print(f'    #+end_src')

my.google.takeout.parser

Parses Google Takeout using google_takeout_parser

See google_takeout_parser for more information about how to export and organize your takeouts

If the DISABLE_TAKEOUT_CACHE environment variable is set, this won't cache individual exports in ~/.cache/google_takeout_parser

The directory set as takeout_path can be unpacked directories, or zip files of the exports, which are temporarily unpacked while creating the cachew cache

class google(user_config):
    # directory which includes unpacked/zipped takeouts
    takeout_path: Paths

    error_policy: ErrorPolicy = 'yield'

    # experimental flag to use core.kompress.ZipPath
    # instead of unpacking to a tmp dir via match_structure
    _use_zippath: bool = False

my.hypothesis

Hypothes.is highlights and annotations

class hypothesis:
    '''
    Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
    '''

    # paths[s]/glob to the exported JSON data
    export_path: Paths

my.pocket

Pocket bookmarks and highlights

class pocket:
    '''
    Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
    '''

    # paths[s]/glob to the exported JSON data
    export_path: Paths

my.twitter.twint

Twitter data (tweets and favorites).

Uses Twint data export.

Requirements: pip3 install --user dataset

class twint:
    export_path: Paths # path[s]/glob to the twint Sqlite database

my.twitter.archive

Twitter data (uses official twitter archive export)

class twitter_archive:
    export_path: Paths # path[s]/glob to the twitter archive takeout

my.lastfm

Last.fm scrobbles

class lastfm:
    """
    Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
    """
    export_path: Paths

my.polar

Polar articles and highlights

class polar:
    '''
    Polar config is optional, you only need it if you want to specify custom 'polar_dir'
    '''
    polar_dir: PathIsh = Path('~/.polar').expanduser()
    defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)

my.instapaper

Instapaper bookmarks, highlights and annotations

class instapaper:
    '''
    Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
    '''
    # path[s]/glob to the exported JSON data
    export_path : Paths

my.github.gdpr

Github data (uses official GDPR export)

class github:
    gdpr_dir: PathIsh  # path to unpacked GDPR archive

my.github.ghexport

Github data: events, comments, etc. (API data)

class github:
    '''
    Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
    '''
    # path[s]/glob to the exported JSON data
    export_path: Paths

    # path to a cache directory
    # if omitted, will use /tmp
    cache_dir: Optional[PathIsh] = None

my.kobo

Kobo e-ink reader: annotations and reading stats

class kobo:
    '''
    Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
    '''
    # path[s]/glob to the exported databases
    export_path: Paths

11 KiB Raw Blame History

TOC

Intro

Configs

11 KiB

Raw Blame History