[ASTERIXDB-2894] Update UDF docs
- user model changes: no
- storage format changes: no
- interface changes: no
Details:
- Update API examples to include type
- Include details about typing and execution model
Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225
Reviewed-by: Ian Maxon <imaxon@uci.edu>
Reviewed-by: Dmitry Lychagin <dmitry.lychagin@couchbase.com>
Integration-Tests: Jenkins <jenkins@fulliautomatix.ics.uci.edu>
Tested-by: Jenkins <jenkins@fulliautomatix.ics.uci.edu>
diff --git a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
index 7ca23bb..655113b 100644
--- a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
+++ b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
@@ -19,7 +19,7 @@
## <a name="introduction">Introduction</a>
-Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java and Python
+Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java, and Python
A user can encapsulate data processing logic into a UDF and invoke it
later repeatedly. For SQL++ functions, a user can refer to [SQL++ Functions](sqlpp/manual.html#Functions)
for their usages. This document will focus on UDFs in languages other than SQL++
@@ -27,8 +27,10 @@
## <a name="authentication">Endpoints and Authentication</a>
-The UDF endpoint is not enabled by default until authentication has been configured properly. To enable it, we
-will need to set the path to the credential file and populate it with our username and password.
+The UDF API endpoint used to deploy functions is not enabled by default until authentication has been configured properly.
+Even if the endpoint is enabled, it is only accessible on the loopback interface on each NC to restrict access.
+
+To enable it, we need to set the path to the credential file and populate it with our username and password.
The credential file is a simple `/etc/passwd` style text file with usernames and corresponding `bcrypt` hashed and salted
passwords. You can populate this on your own if you would like, but the `asterixhelper` utility can write the entries as
@@ -50,9 +52,7 @@
## <a name="installingUDF">Installing a Java UDF Library</a>
To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the `/admin/udf` endpoint
-of the CC at the normal API port (`19002` by default). The request should use HTTP Basic authentication. This means your
-credentials will *not* be obfuscated or encrypted *in any way*, so submit to this endpoint over localhost or a network
-where you know your traffic is safe from eavesdropping. Any suitable tool will do, but for the example here I will use
+of the CC at the normal API port (`19004` by default). Any suitable tool will do, but for the example here I will use
`curl` which is widely available.
For example, to install a library with the following criteria:
@@ -65,7 +65,7 @@
we would execute
- curl -v -u admin:admin -X POST -F 'data=@./lib.zip' localhost:19002/admin/udf/udfs/testlib
+ curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' localhost:19004/admin/udf/udfs/testlib
Any response other than `200` indicates an error in deployment.
@@ -119,7 +119,7 @@
Then, deploy it the same as the Java UDF was, with the library name `pylib` in `udfs` dataverse
- curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' localhost:19002/admin/udf/udfs/pylib
+ curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' localhost:19002/admin/udf/udfs/pylib
With the library deployed, we can define a function within it for use. For example, to expose the Python function
`sentiment` in the module `sentiment_mod` in the class `sent_model`, the `CREATE FUNCTION` would be as follows
@@ -131,14 +131,14 @@
AS "sentiment_mod", "sent_model.sentiment" AT pylib;
By default, AsterixDB will treat all external functions as deterministic. It means the function must return the same
-result for the same input, irrespective of when or how many times the function is called on that input.
-This particular function behaves the same on each input, so it satisfies the deterministic property.
+result for the same input, irrespective of when or how many times the function is called on that input.
+This particular function behaves the same on each input, so it satisfies the deterministic property.
This enables better optimization of queries including this function.
-If a function is not deterministic then it should be declared as such by using `WITH` sub-clause:
+If a function is not deterministic then it should be declared as such by using a `WITH` sub-clause:
USE udfs;
- CREATE FUNCTION sentiment(a)
+ CREATE FUNCTION sentiment(text)
AS "sentiment_mod", "sent_model.sentiment" AT pylib
WITH { "deterministic": false }
@@ -155,6 +155,43 @@
SELECT t.msg as msg, sentiment(t.msg) as sentiment
FROM Tweets t;
+## <a name="pytpes">Python Type Mappings</a>
+
+Currently only a subset of AsterixDB types are supported in Python UDFs. The supported types are as follows:
+
+- Integer types (int8,16,32,64)
+- Floating point types (float, double)
+- String
+- Boolean
+- Arrays, Sets (cast to lists)
+- Objects (cast to dict)
+
+Unsupported types can be cast to these in SQL++ first in order to be passed to a Python UDF
+
+## <a name="execution">Execution Model For UDFs</a>
+
+AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified
+for the purposes of UDFs to
+ - A pre-run phase which allocates resources, `open`
+ - The time during which the job has data flowing through it, `nextFrame`
+ - Cleanup and shutdown in `close`.
+
+If a SQL++ function is defined as a member of a class in the library, the class will be instantiated
+during `open`. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference
+files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor
+of the object will greatly increase the performance of the SQL++ function.
+
+For each function invoked during a query, there will be an independent instance of the function per data partition. This
+means that the function must not assume there is any global state or that it can assume things about the layout
+of the data. The execution of the function will be parallel to the same degree as the level of data parallelism in the
+cluster.
+
+After initialization, the function bound in the SQL++ function definition is called once per tuple during the query
+execution (i.e. `nextFrame`). Unless the function specifies `null-call` in the `WITH` clause, `NULL` values will be
+skipped.
+
+At the close of the query, the function is torn down and not re-used in any way. All functions should assume that
+nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined.
## <a id="UDFOnFeeds">Attaching a UDF on Data Feeds</a>
@@ -239,7 +276,7 @@
functions declared with the library are removed. First we'll drop the function we declared earlier:
USE udfs;
- DROP FUNCTION mysum@2;
+ DROP FUNCTION mysum(a,b);
Then issue the proper `DELETE` request