query: add --warn-exceptions, dateparser, docs
added --warn-exceptions (like --raise-exceptions/--drop-exceptions, but lets you pass a warn_func if you want to customize how the exceptions are handled. By default this creates a logger in main and logs the exception added dateparser as a fallback if its installed (it's not a strong dependency, but I mentioned in the docs that it's useful for parsing dates/times) added docs for query, and a few examples --output gpx respects the --{drop,warn,raise}--exceptions flags, have an example of that in the docs as well
This commit is contained in:
parent
82bc51d9fc
commit
a58fef0d06
7 changed files with 404 additions and 24 deletions
301
doc/QUERY.md
Normal file
301
doc/QUERY.md
Normal file
|
@ -0,0 +1,301 @@
|
|||
`hpi query` is a command line tool for querying the output of any `hpi` function.
|
||||
|
||||
```
|
||||
Usage: hpi query [OPTIONS] FUNCTION_NAME...
|
||||
|
||||
This allows you to query the results from one or more functions in HPI
|
||||
|
||||
By default this runs with '-o json', converting the results to JSON and
|
||||
printing them to STDOUT
|
||||
|
||||
You can specify '-o pprint' to just print the objects using their repr, or
|
||||
'-o repl' to drop into a ipython shell with access to the results
|
||||
|
||||
While filtering using --order-key datetime, the --after, --before and
|
||||
--within flags parse the input to their datetime and timedelta equivalents.
|
||||
datetimes can be epoch time, the string 'now', or an date formatted in the
|
||||
ISO format. timedelta (durations) are parsed from a similar format to the
|
||||
GNU 'sleep' command, e.g. 1w2d8h5m20s -> 1 week, 2 days, 8 hours, 5 minutes,
|
||||
20 seconds
|
||||
|
||||
As an example, to query reddit comments I've made in the last month
|
||||
|
||||
hpi query --order-type datetime --before now --within 4w my.reddit.all.comments
|
||||
or...
|
||||
hpi query --recent 4w my.reddit.all.comments
|
||||
|
||||
Can also query within a range. To filter comments between 2016 and 2018:
|
||||
hpi query --order-type datetime --after '2016-01-01' --before '2019-01-01' my.reddit.all.comments
|
||||
|
||||
Options:
|
||||
-o, --output [json|pprint|repl|gpx]
|
||||
what to do with the result [default: json]
|
||||
-s, --stream stream objects from the data source instead
|
||||
of printing a list at the end
|
||||
-k, --order-key TEXT order by an object attribute or dict key on
|
||||
the individual objects returned by the HPI
|
||||
function
|
||||
-t, --order-type [datetime|date|int|float]
|
||||
order by searching for some type on the
|
||||
iterable
|
||||
-a, --after TEXT while ordering, filter items for the key or
|
||||
type larger than or equal to this
|
||||
-b, --before TEXT while ordering, filter items for the key or
|
||||
type smaller than this
|
||||
-w, --within TEXT a range 'after' or 'before' to filter items
|
||||
by. see above for further explanation
|
||||
-r, --recent TEXT a shorthand for '--order-type datetime
|
||||
--reverse --before now --within'. e.g.
|
||||
--recent 5d
|
||||
--reverse / --no-reverse reverse the results returned from the
|
||||
functions
|
||||
-l, --limit INTEGER limit the number of items returned from the
|
||||
(functions)
|
||||
--drop-unsorted if the order of an item can't be determined
|
||||
while ordering, drop those items from the
|
||||
results
|
||||
--wrap-unsorted if the order of an item can't be determined
|
||||
while ordering, wrap them into an
|
||||
'Unsortable' object
|
||||
--warn-exceptions if any errors are returned, print them as
|
||||
errors on STDERR
|
||||
--raise-exceptions if any errors are returned (as objects, not
|
||||
raised) from the functions, raise them
|
||||
--drop-exceptions ignore any errors returned as objects from
|
||||
the functions
|
||||
```
|
||||
|
||||
This works with any function which returns an iterable, for example `my.coding.commits`, which searches for `git commit`s on your computer:
|
||||
|
||||
```bash
|
||||
hpi query my.coding.commits
|
||||
```
|
||||
|
||||
When run with a module, this does some analysis of the functions in that module and tries to find ones that look like data sources. If it can't figure out which, it prompts you like:
|
||||
|
||||
```
|
||||
Which function should be used from 'my.coding.commits'?
|
||||
|
||||
1. commits
|
||||
2. repos
|
||||
```
|
||||
|
||||
You select the one you want by clicking `1` or `2` on your keyboard. Otherwise, you can provide a fully qualified path, like:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.repos
|
||||
```
|
||||
|
||||
The corresponding `repos` function this queries is defined in [`my/coding/commits.py`](../my/coding/commits.py)
|
||||
|
||||
### Ordering/Filtering/Streaming
|
||||
|
||||
By default, this just returns the items in the order they were returned by the function. This allows you to filter by specifying a `--order-key`, or `--order-type`. For example, to get the 10 most recent commits. `--order-type datetime` will try to automatically figure out which attribute to use. If it chooses the wrong one (since `Commit`s have both a `committed_dt` and `authored_dt`), you could tell it which to use. For example, to scan my computer and find the most recent commit I made:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --order-key committed_dt --limit 1 --reverse --output pprint --stream
|
||||
Commit(committed_dt=datetime.datetime(2023, 4, 14, 23, 9, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
|
||||
authored_dt=datetime.datetime(2023, 4, 14, 23, 4, 1, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200))),
|
||||
message='sources.smscalls: propogate errors if there are breaking '
|
||||
'schema changes',
|
||||
repo='/home/sean/Repos/promnesia-fork',
|
||||
sha='22a434fca9a28df9b0915ccf16368df129d2c9ce',
|
||||
ref='refs/heads/smscalls-handle-result')
|
||||
```
|
||||
|
||||
To instead limit in some range, you can use `--before` and `--within` to filter by a range. For example, to get all the commits I committed in the last day:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d
|
||||
```
|
||||
|
||||
That prints a a list of `Commit` as JSON objects. You could also use `--output pprint` to pretty-print the objects or `--output repl` drop into a REPL.
|
||||
|
||||
To process the JSON, you can pipe it to [`jq`](https://github.com/stedolan/jq). I often use `jq length` to get the count of some output:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --order-type datetime --before now --within 1d | jq length
|
||||
6
|
||||
```
|
||||
|
||||
Because that is such a common use case, the `--recent` flag is a shorthand for `--order-type datetime --reverse --before now --within`. The same as above:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --recent 1d | jq length
|
||||
6
|
||||
```
|
||||
|
||||
To select a range of commits, you can use `--after` and `--before`, passing ISO or epoch timestamps. Those can be full `datetimes` (`2021-01-01T00:05:30`) or just dates (`2021-01-01`). For example, to get all the commits I made on January 1st, 2021:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --order-type datetime --after 2021-01-01 --before 2021-01-02 | jq length
|
||||
1
|
||||
```
|
||||
|
||||
If you have [`dateparser`](https://github.com/scrapinghub/dateparser#how-to-use) installed, this supports dozens more natural language formats:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --order-type datetime --after 'last week' --before 'day before yesterday' | jq length
|
||||
28
|
||||
```
|
||||
|
||||
If you're having issues ordering because there are exceptions in your results not all data is sortable (may have `None` for some attributes), you can use `--drop-unsorted` to drop those items from the results, or `--drop-exceptions` to remove the exceptions
|
||||
|
||||
You can also stream the results, which is useful for functions that take a while to process or have a lot of data. For example, if you wanted to pick a sha hash from a particular repo, you could combine `jq` to `select` and pick that attribute from the JSON:
|
||||
|
||||
```
|
||||
hpi query my.coding.commits.commits --recent 30d --stream | jq 'select(.repo | contains("HPI"))' | jq '.sha' -r
|
||||
4afa899c8b365b3c10e468f6279c02e316d3b650
|
||||
40de162fab741df594b4d9651348ee46ee021e9b
|
||||
e1cb229913482074dc5523e57ef0acf6e9ec2bb2
|
||||
87c13defd131e39292b93dcea661d3191222dace
|
||||
02c738594f2cae36ca4fab43cf9533fe6aa89396
|
||||
0b3a2a6ef3a9e4992771aaea0252fb28217b814a
|
||||
84817ce72d208038b66f634d4ceb6e3a4c7ec5e9
|
||||
47992b8e046d27fc5141839179f06f925c159510
|
||||
425615614bd508e28ccceb56f43c692240e429ab
|
||||
eed8f949460d768fb1f1c4801e9abab58a5f9021
|
||||
d26ad7d9ce6a4718f96346b994c3c1cd0d74380c
|
||||
aec517e53c6ac022f2b4cc91261daab5651cebf0
|
||||
44b75a88fdfc7af132f61905232877031ce32fcb
|
||||
b0ff6f29dd2846e97f8aa85a2ca73736b03254a8
|
||||
```
|
||||
|
||||
`select` acts on a stream of JSON objects, not a list, so it filters as the objects are generated. The alternative would be to print the entire JSON list at the end, like:
|
||||
|
||||
`hpi query my.coding.commits.commits --recent 30d | jq '.[] | select(.repo | contains("Repos/HPI"))' | jq '.sha' -r`, using `jq '.[]'` to convert the JSON list into a stream of JSON objects.
|
||||
|
||||
## Usage on non-HPI code
|
||||
|
||||
The command can accept any qualified function name, so this could for example be used to check the output of [`promnesia`](https://github.com/karlicoss/promnesia) commands:
|
||||
|
||||
```
|
||||
hpi query promnesia.sources.smscalls | jq length
|
||||
371
|
||||
```
|
||||
|
||||
## GPX
|
||||
|
||||
The `hpi query` command can also be used with the `--output gpx` flag to generate GPX files from a list of locations, like the ones defined in the `my.location` package. This could be used to extract some date range and create a `gpx` file which can then be visualized by a GUI application.
|
||||
|
||||
This prints the contents for the `GPX` file to STDOUT, and prints warnings for any objects it could not convert to locations to STDERR, so pipe STDOUT to a output file, like `>out.gpx`
|
||||
|
||||
```
|
||||
hpi query my.location.all --after '2021-07-01T00:00:00' --before '2021-07-05T00:00:00' --order-type datetime --output gpx >out.gpx
|
||||
```
|
||||
|
||||
If you want to ignore any errors, you can use `--drop-exceptions`.
|
||||
|
||||
To preview, you can use something like [`qgis`](https://qgis.org/en/site/) or for something easier more lightweight, [`gpxsee`](https://github.com/tumic0/GPXSee):
|
||||
|
||||
`gpxsee out.gpx`:
|
||||
|
||||
TODO: add image here
|
||||
|
||||
(Sidenote: this is [`@seanbreckenridge`](https://github.com/seanbreckenridge/)s locations, on a trip to Chicago)
|
||||
|
||||
## Python reference
|
||||
|
||||
The `hpi query` command is a CLI wrapper around the code in [`query.py`](../my/core/query.py) and [`query_range.py`](../my/core/query_range.py). The `select` function is the core of this, and `select_range` lets you specify dates, timedelta, start-end ranges, and other CLI-specific code.
|
||||
|
||||
`select`:
|
||||
|
||||
```
|
||||
A function to query, order, sort and filter items from one or more sources
|
||||
This supports iterables and lists of mixed types (including handling errors),
|
||||
by allowing you to provide custom predicates (functions) which can sort
|
||||
by a function, an attribute, dict key, or by the attributes values.
|
||||
|
||||
Since this supports mixed types, there's always a possibility
|
||||
of KeyErrors or AttributeErrors while trying to find some value to order by,
|
||||
so this provides multiple mechanisms to deal with that
|
||||
|
||||
'where' lets you filter items before ordering, to remove possible errors
|
||||
or filter the iterator by some condition
|
||||
|
||||
There are multiple ways to instruct select on how to order items. The most
|
||||
flexible is to provide an 'order_by' function, which takes an item in the
|
||||
iterator, does any custom checks you may want and then returns the value to sort by
|
||||
|
||||
'order_key' is best used on items which have a similar structure, or have
|
||||
the same attribute name for every item in the iterator. If you have a
|
||||
iterator of objects whose datetime is accessed by the 'timestamp' attribute,
|
||||
supplying order_key='timestamp' would sort by that (dictionary or attribute) key
|
||||
|
||||
'order_value' is the most confusing, but often the most useful. Instead of
|
||||
testing against the keys of an item, this allows you to write a predicate
|
||||
(function) to test against its values (dictionary, NamedTuple, dataclass, object).
|
||||
If you had an iterator of mixed types and wanted to sort by the datetime,
|
||||
but the attribute to access the datetime is different on each type, you can
|
||||
provide `order_value=lambda v: isinstance(v, datetime)`, and this will
|
||||
try to find that value for each type in the iterator, to sort it by
|
||||
the value which is received when the predicate is true
|
||||
|
||||
'order_value' is often used in the 'hpi query' interface, because of its brevity.
|
||||
Just given the input function, this can typically sort it by timestamp with
|
||||
no human intervention. It can sort of be thought as an educated guess,
|
||||
but it can always be improved by providing a more complete guess function
|
||||
|
||||
Note that 'order_value' is also the most computationally expensive, as it has
|
||||
to copy the iterator in memory (using itertools.tee) to determine how to order it
|
||||
in memory
|
||||
|
||||
The 'drop_exceptions', 'raise_exceptions', 'warn_exceptions' let you ignore or raise
|
||||
when the src contains exceptions. The 'warn_func' lets you provide a custom function
|
||||
to call when an exception is encountered instead of using the 'warnings' module
|
||||
|
||||
src: an iterable of mixed types, or a function to be called,
|
||||
as the input to this function
|
||||
|
||||
where: a predicate which filters the results before sorting
|
||||
|
||||
order_by: a function which when given an item in the src,
|
||||
returns the value to sort by. Similar to the 'key' value
|
||||
typically passed directly to 'sorted'
|
||||
|
||||
order_key: a string which represents a dict key or attribute name
|
||||
to use as they key to sort by
|
||||
|
||||
order_value: predicate which determines which attribute on an ADT-like item to sort by,
|
||||
when given its value. lambda o: isinstance(o, datetime) is commonly passed to sort
|
||||
by datetime, without knowing the attributes or interface for the items in the src
|
||||
|
||||
default: while ordering, if the order for an object cannot be determined,
|
||||
use this as the default value
|
||||
|
||||
reverse: reverse the order of the resulting iterable
|
||||
|
||||
limit: limit the results to this many items
|
||||
|
||||
drop_unsorted: before ordering, drop any items from the iterable for which a
|
||||
order could not be determined. False by default
|
||||
|
||||
wrap_unsorted: before ordering, wrap any items into an 'Unsortable' object. Place
|
||||
them at the front of the list. True by default
|
||||
|
||||
drop_exceptions: ignore any exceptions from the src
|
||||
|
||||
raise_exceptions: raise exceptions when received from the input src
|
||||
```
|
||||
|
||||
`select_range`:
|
||||
|
||||
```
|
||||
A specialized select function which offers generating functions
|
||||
to filter/query ranges from an iterable
|
||||
|
||||
order_key and order_value are used in the same way they are in select
|
||||
|
||||
If you specify order_by_value_type, it tries to search for an attribute
|
||||
on each object/type which has that type, ordering the iterable by that value
|
||||
|
||||
unparsed_range is a tuple of length 3, specifying 'after', 'before', 'duration',
|
||||
i.e. some start point to allow the computed value we're ordering by, some
|
||||
end point and a duration (can use the RangeTuple NamedTuple to construct one)
|
||||
|
||||
(this is typically parsed/created in my.core.__main__, from CLI flags
|
||||
|
||||
If you specify a range, drop_unsorted is forced to be True
|
||||
```
|
||||
|
||||
Those can be imported and accept any sort of iterator, `hpi query` just defaults to the output of functions here. As an example, see [`listens`](https://github.com/seanbreckenridge/HPI-personal/blob/master/scripts/listens) which just passes an generator (iterator) as the first argument to `query_range`
|
Loading…
Add table
Add a link
Reference in a new issue