docs: wip on better explanation of configs/diagram

2020-05-26 21:11:10 +01:00 · 2020-05-26 21:11:10 +01:00 · 150a6a8cb7
commit 150a6a8cb7
parent 04eca6face
1 changed files with 109 additions and 2 deletions
--- a/doc/SETUP.org
+++ b/doc/SETUP.org
@ -33,9 +33,9 @@ You'd be really helping me, I want to make the setup as straightforward as possi
 * Few notes
-I understand people may not super familiar with Python, PIP or generally unix, so here are some useful notes:
+I understand that people who'd like to use this may not be super familiar with Python, PIP or generally unix, so here are some useful notes:
- only python3 is supported, and more specifically, ~python >= 3.6~.
+- only ~python >= 3.6~ is supported
 - I'm using ~pip3~ command, but on your system you might only have ~pip~.
  If your ~pip --version~ says python 3, feel free to use ~pip~.
@ -239,6 +239,8 @@ If you only have few modules set up, lots of them will error for you, which is e
 If you have any ideas on how to improve it, please let me know!
 Here's a screenshot how it looks when everything is mostly good: [[https://user-images.githubusercontent.com/291333/82806066-f7dfe400-9e7c-11ea-8763-b3bee8ada308.png][link]].
 * Usage examples
 If you run your script with ~with_my~ wrapper, you'd have ~my~ in ~PYTHONPATH~ which gives you access to your data from within the script.
@ -349,3 +351,108 @@ This could be useful to monkey patch some behaviours, or dynamically add some ex
 I'll put up a better guide on this, in the meantime see [[https://packaging.python.org/guides/packaging-namespace-packages]["namespace packages"]] for more info.
 # TODO add example with overriding 'all'
 * TODO diagram data flow/ 'how it works?'
 Here
 TODO link to some polar repository
 Also, check out [[https://beepb00p.xyz/myinfra.html#mypkg][my infrastructure map]].
 ** Polar Bookshelf
 Polar keeps the data on your disk, in =~/.polar=, in a bunch of JSON files.
 It's excellent from all perspective, except one -- you can only use it through Polar interface.
 Which is, by all means, an awesome app. But you might want to integrate your data elsewhere.
 TODO
 https://github.com/TheCedarPrince/KnowledgeRepository
 You can see it's messy: scattered across multiple directories, contains raw HTML, obsure entities, etc.
 It's completely understandable from the app developer's perspective, but it makes things frustrating when you want to work with data.
 Here comes the HPI my.polar module!
 : | ~/.polar  (raw, messy data) |-------- HPI (my.polar) -------> | XX python interface >
 Note that it doesn't require any extra configuration -- it just works because the data is kept locally in the *known location*.
 # TODO org-mode examples?
 ** Google Takeout
 # TODO twitter archive might be better here?
 Google Takeout exports are manual (or semi-manual if you do some voodoo with mounting Googe Drive).
 Anyway, say you're doing it once in six months, so you end up with a bunch of archives:
 : /backups/takeout/takeout-20151201.zip
 : ....
 : /backups/takeout/takeout-20190901.zip
 : /backups/takeout/takeout-20200301.zip
 Inside the archives.... there is a [[https://www.specytech.com/blog/wp-content/uploads/2019/06/google-takeout-folder.png][bunch]] of random files from all your google services.
 Lately, many of them are JSONs, but for example, in 2015 most of it was in HTMLs! It's a nightmare to work with, even when you're an experienced programmer.
 # Even within a single data source (e.g. =My Activity/Search=) you have a mix of HTML and JSON files.
 # todo eh, I need to actually add json processing first
 Of course, HPI also helps you here by encapsulating all this parsing logic and exposing Python interfaces.
 The only thing you have to do is to tell it where to find the files via the config! (because different people use different paths for backups )
 # TODO how to emphasize config?
 # TOOD python is just one of the interfaces?
 : < Google  | ------>----{ manual download } ------->---- | /backups/takeout/*.zip | -------- HPI (my.google.takeout) -----> | python interface >
 The only thing you're required to do is to tell HPI how to find your Google Takeout backups via config.py setting (TODO link)
 ** Reddit
 Reddit has a proper API, so in theory HPI could talk directly to reddit and.
 But that's not what it doing!
 First, there are excellent programmatic APIs for Reddit out there anyway, TODO praw.
 But second, this is the design decision of HPI -- it only accesses your filesystem, and doesn't deal with all with the complexities on API interactions.
 # TODO link to post
 Instead, it relies on other tools to put intermediate, raw data, on your disk and then transforms this data into something nice.
 As an example, for Reddit, HPI is using rexport library for fetching the data from Reddit, to your disk. So the pipeline looks like:
 : < Reddit | ----->----- { rexport/export.py  } ----->---- | /backups/reddit/*.json | ------- HPI (my.reddit) ---> | python interface >
 So, in your config, similarly to Takeout, you're gonna need =export_path= so HPI can find your Reddit data.
 But there is an extra caveat: rexport is also keeping data binding close TODO pu (TODO link to post?).
 So we need to tell HPI how to find rexport via TODO setting.
 # todo running in cron
 ** Twitter
 Twitter is interesting, because it's an example of a data source that *arbitrates* between several.
 The reason is: there is Twitter Archive, but it's manual, takes several days to complete and TODO
 There is also twint, which can get realtime Twitter data via scraping. But Twitter as a limitation and you can't get data past 3200 tweets.
 So the idea is to export both data sources:
 : /         |  ----->----- { manual archive download  } ------>---- | /backups/twitter-archives/*.zip | ...
 : | Twitter |                                                       |                                 | ...
 : \         |  ----->----- { twint (automatic export) } ------>-----| /backups/twint.sqlite           | ...
 # TODO note that the left and right parts of the diagram ('before filesystem' and 'after filesystem') are completely independent!
 # if something breaks, you can still read your old data from the filesystem!
 1. Process data from twitter archives (manual export, but has all the data)
 2. Process data from twint database (automatic export, but only recent data)
 3. Merge them together, orverlaying twint data on top of twitter archive data
 : ... | /backups/twitter-archives/*.zip | -- HPI (my.twitter.archive) ---\-------------------------------| python interface >
 : ... |                                 |                                 >--- HPI (my.twitter.all)  --- | python interface >
 : ... | /backups/twint.sqlite           | -- HPI (my.twitter.twint)   ---/------------------------------ | python interface >
 The auxiliary module =my.twitter.all= (TODO link) (It's really simple, check it out) arbitrates the data sources and gives you a unified view.
 Note that you can always just use =my.twitter.archive= or =my.twitter.twint= directly.
 # (TODO mypy-safe?)
 Now, say you prefer to use a different library for your Twitter data instead of twint (for whatever reason)