query: add --warn-exceptions, dateparser, docs

added --warn-exceptions (like --raise-exceptions/--drop-exceptions, but lets you pass a warn_func if you want to customize how the exceptions are handled. By default this creates a logger in main and logs the exception added dateparser as a fallback if its installed (it's not a strong dependency, but I mentioned in the docs that it's useful for parsing dates/times) added docs for query, and a few examples --output gpx respects the --{drop,warn,raise}--exceptions flags, have an example of that in the docs as well
2023-04-15 12:09:47 -07:00 · 2023-04-15 12:09:47 -07:00 · a58fef0d06
commit a58fef0d06
parent 82bc51d9fc
7 changed files with 404 additions and 24 deletions
--- a/doc/QUERY.md
+++ b/doc/QUERY.md
@ -0,0 +1,301 @@
+`hpi query` is a command line tool for querying the output of any `hpi` function.
+
+```
+Usage: hpi query [OPTIONS] FUNCTION_NAME...
+
+  This allows you to query the results from one or more functions in HPI
+
+  By default this runs with '-o json', converting the results to JSON and
+  printing them to STDOUT
+
+  You can specify '-o pprint' to just print the objects using their repr, or
+  '-o repl' to drop into a ipython shell with access to the results
+
+  While filtering using --order-key datetime, the --after, --before and
+  --within flags parse the input to their datetime and timedelta equivalents.
+  datetimes can be epoch time, the string 'now', or an date formatted in the
+  ISO format. timedelta (durations) are parsed from a similar format to the
+  GNU 'sleep' command, e.g. 1w2d8h5m20s -> 1 week, 2 days, 8 hours, 5 minutes,
+  20 seconds
+
+  As an example, to query reddit comments I've made in the last month
+
+  hpi query --order-type datetime --before now --within 4w my.reddit.all.comments
+  or...
+  hpi query --recent 4w my.reddit.all.comments
+
+  Can also query within a range. To filter comments between 2016 and 2018:
+  hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
+
+Options:
+  -o, --output [json|pprint|repl|gpx]
+                                  what to do with the result [default: json]
+  -s, --stream                    stream objects from the data source instead
+                                  of printing a list at the end
+  -k, --order-key TEXT            order by an object attribute or dict key on
+                                  the individual objects returned by the HPI
+                                  function
+  -t, --order-type [datetime|date|int|float]
+                                  order by searching for some type on the
+                                  iterable
+  -a, --after TEXT                while ordering, filter items for the key or
+                                  type larger than or equal to this
+  -b, --before TEXT               while ordering, filter items for the key or
+                                  type smaller than this
+  -w, --within TEXT               a range 'after' or 'before' to filter items
+                                  by. see above for further explanation
+  -r, --recent TEXT               a shorthand for '--order-type datetime
+                                  --reverse --before now --within'. e.g.
+                                  --recent 5d
+  --reverse / --no-reverse        reverse the results returned from the
+                                  functions
+  -l, --limit INTEGER             limit the number of items returned from the
+                                  (functions)
+  --drop-unsorted                 if the order of an item can't be determined
+                                  while ordering, drop those items from the
+                                  results
+  --wrap-unsorted                 if the order of an item can't be determined
+                                  while ordering, wrap them into an
+                                  'Unsortable' object
+  --warn-exceptions               if any errors are returned, print them as
+                                  errors on STDERR
+  --raise-exceptions              if any errors are returned (as objects, not
+                                  raised) from the functions, raise them
+  --drop-exceptions               ignore any errors returned as objects from
+                                  the functions
+```
+
+This works with any function which returns an iterable, for example `my.coding.commits`, which searches for `git commit`s on your computer:
+
+```bash
+hpi query my.coding.commits
+```
+
+When run with a module, this does some analysis of the functions in that module and tries to find ones that look like data sources. If it can't figure out which, it prompts you like:
+
+```
+Which function should be used from 'my.coding.commits'?
+
+	1. commits
+	2. repos
+```
+
+You select the one you want by clicking `1` or `2` on your keyboard. Otherwise, you can provide a fully qualified path, like:
+
+```
+hpi query my.coding.commits.repos
+```
+
+The corresponding `repos` function this queries is defined in [`my/coding/commits.py`](../my/coding/commits.py)
+
+### Ordering/Filtering/Streaming
+
+By default, this just returns the items in the order they were returned by the function. This allows you to filter by specifying a `--order-key`, or `--order-type`. For example, to get the 10 most recent commits. `--order-type datetime` will try to automatically figure out which attribute to use. If it chooses the wrong one (since `Commit`s have both a `committed_dt` and `authored_dt`), you could tell it which to use. For example, to scan my computer and find the most recent commit I made:
+
+```
+hpi query my.coding.commits.commits --order-key committed_dt --limit 1 --reverse --output pprint --stream
+Commit(committed_dt=datetime.datetime(2023, 4, 14, 23, 9, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
+       authored_dt=datetime.datetime(2023, 4, 14, 23, 4, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
+       message='sources.smscalls: propogate errors if there are breaking '
+               'schema changes',
+       repo='/home/sean/Repos/promnesia-fork',
+       sha='22a434fca9a28df9b0915ccf16368df129d2c9ce',
+       ref='refs/heads/smscalls-handle-result')
+```
+
+To instead limit in some range, you can use `--before` and `--within` to filter by a range. For example, to get all the commits I committed in the last day:
+
+```
+hpi query my.coding.commits.commits --order-type datetime --before now --within 1d
+```
+
+That prints a a list of `Commit` as JSON objects. You could also use `--output pprint` to pretty-print the objects or `--output repl` drop into a REPL.
+
+To process the JSON, you can pipe it to [`jq`](https://github.com/stedolan/jq). I often use `jq length` to get the count of some output:
+
+```
+hpi query my.coding.commits.commits --order-type datetime --before now --within 1d | jq length
+6
+```
+
+Because that is such a common use case, the `--recent` flag is a shorthand for `--order-type datetime --reverse --before now --within`. The same as above:
+
+```
+hpi query my.coding.commits.commits --recent 1d | jq length
+6
+```
+
+To select a range of commits, you can use `--after` and `--before`, passing ISO or epoch timestamps. Those can be full `datetimes` (`2021-01-01T00:05:30`) or just dates (`2021-01-01`). For example, to get all the commits I made on January 1st, 2021:
+
+```
+hpi query my.coding.commits.commits --order-type datetime --after 2021-01-01 --before 2021-01-02 | jq length
+1
+```
+
+If you have [`dateparser`](https://github.com/scrapinghub/dateparser#how-to-use) installed, this supports dozens more natural language formats:
+
+```
+hpi query my.coding.commits.commits --order-type datetime --after 'last week' --before 'day before yesterday' | jq length
+28
+```
+
+If you're having issues ordering because there are exceptions in your results not all data is sortable (may have `None` for some attributes), you can use `--drop-unsorted` to drop those items from the results, or `--drop-exceptions` to remove the exceptions
+
+You can also stream the results, which is useful for functions that take a while to process or have a lot of data. For example, if you wanted to pick a sha hash from a particular repo, you could combine `jq` to `select` and pick that attribute from the JSON:
+
+```
+hpi query my.coding.commits.commits --recent 30d --stream | jq 'select(.repo | contains("HPI"))' | jq '.sha' -r
+4afa899c8b365b3c10e468f6279c02e316d3b650
+40de162fab741df594b4d9651348ee46ee021e9b
+e1cb229913482074dc5523e57ef0acf6e9ec2bb2
+87c13defd131e39292b93dcea661d3191222dace
+02c738594f2cae36ca4fab43cf9533fe6aa89396
+0b3a2a6ef3a9e4992771aaea0252fb28217b814a
+84817ce72d208038b66f634d4ceb6e3a4c7ec5e9
+47992b8e046d27fc5141839179f06f925c159510
+425615614bd508e28ccceb56f43c692240e429ab
+eed8f949460d768fb1f1c4801e9abab58a5f9021
+d26ad7d9ce6a4718f96346b994c3c1cd0d74380c
+aec517e53c6ac022f2b4cc91261daab5651cebf0
+44b75a88fdfc7af132f61905232877031ce32fcb
+b0ff6f29dd2846e97f8aa85a2ca73736b03254a8
+```
+
+`select` acts on a stream of JSON objects, not a list, so it filters as the objects are generated. The alternative would be to print the entire JSON list at the end, like:
+
+`hpi query my.coding.commits.commits --recent 30d | jq '.[] | select(.repo | contains("Repos/HPI"))' | jq '.sha' -r`, using `jq '.[]'` to convert the JSON list into a stream of JSON objects.
+
+## Usage on non-HPI code
+
+The command can accept any qualified function name, so this could for example be used to check the output of [`promnesia`](https://github.com/karlicoss/promnesia) commands:
+
+```
+hpi query promnesia.sources.smscalls | jq length
+371
+```
+
+## GPX
+
+The `hpi query` command can also be used with the `--output gpx` flag to generate GPX files from a list of locations, like the ones defined in the `my.location` package. This could be used to extract some date range and create a `gpx` file which can then be visualized by a GUI application.
+
+This prints the contents for the `GPX` file to STDOUT, and prints warnings for any objects it could not convert to locations to STDERR, so pipe STDOUT to a output file, like `>out.gpx`
+
+```
+hpi query my.location.all --after '2021-07-01T00:00:00' --before '2021-07-05T00:00:00' --order-type datetime --output gpx >out.gpx
+```
+
+If you want to ignore any errors, you can use `--drop-exceptions`.
+
+To preview, you can use something like [`qgis`](https://qgis.org/en/site/) or for something easier more lightweight, [`gpxsee`](https://github.com/tumic0/GPXSee):
+
+`gpxsee out.gpx`:
+
+TODO: add image here
+
+(Sidenote: this is [`@seanbreckenridge`](https://github.com/seanbreckenridge/)s locations, on a trip to Chicago)
+
+## Python reference
+
+The `hpi query` command is a CLI wrapper around the code in [`query.py`](../my/core/query.py) and [`query_range.py`](../my/core/query_range.py). The `select` function is the core of this, and `select_range` lets you specify dates, timedelta, start-end ranges, and other CLI-specific code.
+
+`select`:
+
+```
+    A function to query, order, sort and filter items from one or more sources
+    This supports iterables and lists of mixed types (including handling errors),
+    by allowing you to provide custom predicates (functions) which can sort
+    by a function, an attribute, dict key, or by the attributes values.
+
+    Since this supports mixed types, there's always a possibility
+    of KeyErrors or AttributeErrors while trying to find some value to order by,
+    so this provides multiple mechanisms to deal with that
+
+    'where' lets you filter items before ordering, to remove possible errors
+    or filter the iterator by some condition
+
+    There are multiple ways to instruct select on how to order items. The most
+    flexible is to provide an 'order_by' function, which takes an item in the
+    iterator, does any custom checks you may want and then returns the value to sort by
+
+    'order_key' is best used on items which have a similar structure, or have
+    the same attribute name for every item in the iterator. If you have a
+    iterator of objects whose datetime is accessed by the 'timestamp' attribute,
+    supplying order_key='timestamp' would sort by that (dictionary or attribute) key
+
+    'order_value' is the most confusing, but often the most useful. Instead of
+    testing against the keys of an item, this allows you to write a predicate
+    (function) to test against its values (dictionary, NamedTuple, dataclass, object).
+    If you had an iterator of mixed types and wanted to sort by the datetime,
+    but the attribute to access the datetime is different on each type, you can
+    provide `order_value=lambda v: isinstance(v, datetime)`, and this will
+    try to find that value for each type in the iterator, to sort it by
+    the value which is received when the predicate is true
+
+    'order_value' is often used in the 'hpi query' interface, because of its brevity.
+    Just given the input function, this can typically sort it by timestamp with
+    no human intervention. It can sort of be thought as an educated guess,
+    but it can always be improved by providing a more complete guess function
+
+    Note that 'order_value' is also the most computationally expensive, as it has
+    to copy the iterator in memory (using itertools.tee) to determine how to order it
+    in memory
+
+    The 'drop_exceptions', 'raise_exceptions', 'warn_exceptions' let you ignore or raise
+    when the src contains exceptions. The 'warn_func' lets you provide a custom function
+    to call when an exception is encountered instead of using the 'warnings' module
+
+    src:            an iterable of mixed types, or a function to be called,
+                    as the input to this function
+
+    where:          a predicate which filters the results before sorting
+
+    order_by:       a function which when given an item in the src,
+                    returns the value to sort by. Similar to the 'key' value
+                    typically passed directly to 'sorted'
+
+    order_key:      a string which represents a dict key or attribute name
+                    to use as they key to sort by
+
+    order_value:    predicate which determines which attribute on an ADT-like item to sort by,
+                    when given its value. lambda o: isinstance(o, datetime) is commonly passed to sort
+                    by datetime, without knowing the attributes or interface for the items in the src
+
+    default:        while ordering, if the order for an object cannot be determined,
+                    use this as the default value
+
+    reverse:        reverse the order of the resulting iterable
+
+    limit:          limit the results to this many items
+
+    drop_unsorted:  before ordering, drop any items from the iterable for which a
+                    order could not be determined. False by default
+
+    wrap_unsorted:  before ordering, wrap any items into an 'Unsortable' object. Place
+                    them at the front of the list. True by default
+
+    drop_exceptions: ignore any exceptions from the src
+
+    raise_exceptions: raise exceptions when received from the input src
+```
+
+`select_range`:
+
+```
+    A specialized select function which offers generating functions
+    to filter/query ranges from an iterable
+
+    order_key and order_value are used in the same way they are in select
+
+    If you specify order_by_value_type, it tries to search for an attribute
+    on each object/type which has that type, ordering the iterable by that value
+
+    unparsed_range is a tuple of length 3, specifying 'after', 'before', 'duration',
+    i.e. some start point to allow the computed value we're ordering by, some
+    end point and a duration (can use the RangeTuple NamedTuple to construct one)
+
+    (this is typically parsed/created in my.core.__main__, from CLI flags
+
+    If you specify a range, drop_unsorted is forced to be True
+```
+
+Those can be imported and accept any sort of iterator, `hpi query` just defaults to the output of functions here. As an example, see [`listens`](https://github.com/seanbreckenridge/HPI-personal/blob/master/scripts/listens) which just passes an generator (iterator) as the first argument to `query_range`