blob: 23afd85142a7792e6a5faefab7e88a440891eb71 [file] [log] [blame]
Ian Maxon858061a2022-05-12 19:11:28 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia Site Renderer 1.8.1 from target/generated-site/markdown/udf.md at 2022-05-12
4 | Rendered using Apache Maven Fluido Skin 1.7
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20220512" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; User-defined Functions</title>
13 <link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16 <script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
17
18 </head>
19 <body class="topBarDisabled">
20 <div class="container-fluid">
21 <div id="banner">
22 <div class="pull-left"><a href="./" id="bannerLeft"><img src="images/asterixlogo.png" alt="AsterixDB"/></a></div>
23 <div class="pull-right"></div>
24 <div class="clear"><hr/></div>
25 </div>
26
27 <div id="breadcrumbs">
28 <ul class="breadcrumb">
29 <li id="publishDate">Last Published: 2022-05-12</li>
30 <li id="projectVersion" class="pull-right">Version: 0.9.8</li>
31 <li class="pull-right"><a href="index.html" title="Documentation Home">Documentation Home</a></li>
32 </ul>
33 </div>
34 <div class="row-fluid">
35 <div id="leftColumn" class="span2">
36 <div class="well sidebar-nav">
37 <ul class="nav nav-list">
38 <li class="nav-header">Get Started - Installation</li>
39 <li><a href="ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
40 <li><a href="ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
41 <li><a href="aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
42 <li class="nav-header">AsterixDB Primer</li>
43 <li><a href="sqlpp/primer-sqlpp.html" title="Using SQL++"><span class="none"></span>Using SQL++</a></li>
44 <li class="nav-header">Data Model</li>
45 <li><a href="datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
46 <li class="nav-header">Queries</li>
47 <li><a href="sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
48 <li><a href="SQLPP.html" title="Raw SQL++ Grammar"><span class="none"></span>Raw SQL++ Grammar</a></li>
49 <li><a href="sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
50 <li class="nav-header">API/SDK</li>
51 <li><a href="api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
52 <li><a href="csv.html" title="CSV Output"><span class="none"></span>CSV Output</a></li>
53 <li class="nav-header">Advanced Features</li>
54 <li><a href="aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
55 <li><a href="feeds.html" title="Data Ingestion with Feeds"><span class="none"></span>Data Ingestion with Feeds</a></li>
56 <li class="active"><a href="#"><span class="none"></span>User Defined Functions</a></li>
57 <li><a href="sqlpp/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
58 <li><a href="sqlpp/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
59 <li><a href="sqlpp/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
60 <li><a href="geo/quickstart.html" title="GIS Support Overview"><span class="none"></span>GIS Support Overview</a></li>
61 <li><a href="geo/functions.html" title="GIS Functions"><span class="none"></span>GIS Functions</a></li>
62 <li><a href="interval_join.html" title="Support of Interval Joins"><span class="none"></span>Support of Interval Joins</a></li>
63 <li><a href="spatial_join.html" title="Support of Spatial Joins"><span class="none"></span>Support of Spatial Joins</a></li>
64 <li><a href="sqlpp/arrayindex.html" title="Support of Array Indexes"><span class="none"></span>Support of Array Indexes</a></li>
65 <li class="nav-header">Deprecated</li>
66 <li><a href="aql/primer.html" title="AsterixDB Primer: Using AQL"><span class="none"></span>AsterixDB Primer: Using AQL</a></li>
67 <li><a href="aql/manual.html" title="Queries: The Asterix Query Language (AQL)"><span class="none"></span>Queries: The Asterix Query Language (AQL)</a></li>
68 <li><a href="aql/builtins.html" title="Queries: Builtin Functions (AQL)"><span class="none"></span>Queries: Builtin Functions (AQL)</a></li>
69</ul>
70 <hr />
71 <div id="poweredBy">
72 <div class="clear"></div>
73 <div class="clear"></div>
74 <div class="clear"></div>
75 <div class="clear"></div>
76<a href="./" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" /></a>
77 </div>
78 </div>
79 </div>
80 <div id="bodyColumn" class="span10" >
81<!--
82 ! Licensed to the Apache Software Foundation (ASF) under one
83 ! or more contributor license agreements. See the NOTICE file
84 ! distributed with this work for additional information
85 ! regarding copyright ownership. The ASF licenses this file
86 ! to you under the Apache License, Version 2.0 (the
87 ! "License"); you may not use this file except in compliance
88 ! with the License. You may obtain a copy of the License at
89 !
90 ! http://www.apache.org/licenses/LICENSE-2.0
91 !
92 ! Unless required by applicable law or agreed to in writing,
93 ! software distributed under the License is distributed on an
94 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
95 ! KIND, either express or implied. See the License for the
96 ! specific language governing permissions and limitations
97 ! under the License.
98 !-->
99<h1>User-defined Functions</h1>
100<div class="section">
101<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
102<ul>
103
104<li><a href="#introduction">Introduction</a></li>
105<li><a href="#installingUDF">Installing an UDF Library</a></li>
106<li><a href="#UDFOnFeeds">Attaching an UDF on Data Feeds</a></li>
107<li><a href="#udfConfiguration">A quick look of the UDF configuration</a></li>
108<li><a href="#adapter">User defined Feed Adapters</a></li>
109<li><a href="#uninstall">Unstalling an UDF Library</a><!--
110! Licensed to the Apache Software Foundation (ASF) under one
111! or more contributor license agreements. See the NOTICE file
112! distributed with this work for additional information
113! regarding copyright ownership. The ASF licenses this file
114! to you under the Apache License, Version 2.0 (the
115! "License"); you may not use this file except in compliance
116! with the License. You may obtain a copy of the License at
117!
118! http://www.apache.org/licenses/LICENSE-2.0
119!
120! Unless required by applicable law or agreed to in writing,
121! software distributed under the License is distributed on an
122! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
123! KIND, either express or implied. See the License for the
124! specific language governing permissions and limitations
125! under the License.
126!--></li>
127</ul></div>
128<div class="section">
129<h2><a name="Introduction"></a><a name="introduction">Introduction</a></h2>
130<p>Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java, and Python A user can encapsulate data processing logic into a UDF and invoke it later repeatedly. For SQL++ functions, a user can refer to <a href="sqlpp/manual.html#Functions">SQL++ Functions</a> for their usages. This document will focus on UDFs in languages other than SQL++</p></div>
131<div class="section">
132<h2><a name="Endpoints_and_Authentication"></a><a name="authentication">Endpoints and Authentication</a></h2>
133<p>The UDF API endpoint used to deploy functions is not enabled by default until authentication has been configured properly. Even if the endpoint is enabled, it is only accessible on the loopback interface on each NC to restrict access.</p>
134<p>To enable it, we need to set the path to the credential file and populate it with our username and password.</p>
135<p>The credential file is a simple <tt>/etc/passwd</tt> style text file with usernames and corresponding <tt>bcrypt</tt> hashed and salted passwords. You can populate this on your own if you would like, but the <tt>asterixhelper</tt> utility can write the entries as well. We can invoke <tt>asterixhelper</tt> like so:</p>
136
137<div>
138<div>
139<pre class="source">$ bin/asterixhelper -u admin -p admin -cp opt/local/conf add_credential
140</pre></div></div>
141
142<p>Then, in your <tt>cc.conf</tt>, in the <tt>[cc]</tt> section, add the correct <tt>credential.file</tt> path</p>
143
144<div>
145<div>
146<pre class="source">[nc]
147address = 127.0.0.1
148...
149...
150credential.file = conf/passwd
151</pre></div></div>
152
153<p>Now,restart the cluster if it was already started to allow the Cluster Controller to find the new credentials.</p></div>
154<div class="section">
155<h2><a name="Installing_a_Java_UDF_Library"></a><a name="installingUDF">Installing a Java UDF Library</a></h2>
156<p>To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the <tt>/admin/udf</tt> endpoint of the CC at the normal API port (<tt>19004</tt> by default). Any suitable tool will do, but for the example here I will use <tt>curl</tt> which is widely available.</p>
157<p>For example, to install a library with the following criteria:</p>
158<ul>
159
160<li><tt>udfs</tt> dataverse name</li>
161<li>with a new Library name of <tt>testlib</tt></li>
162<li>from <tt>lib.zip</tt> in the present working directory</li>
163<li>to the cluster at <tt>localhost</tt> with API port <tt>19004</tt> of the Asterix CC</li>
164<li>with credentials being a username and password of <tt>admin:admin</tt></li>
165</ul>
166<p>we would execute</p>
167
168<div>
169<div>
170<pre class="source">curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' localhost:19004/admin/udf/udfs/testlib
171</pre></div></div>
172
173<p>Any response other than <tt>200</tt> indicates an error in deployment.</p>
174<p>In the AsterixDB source release, we provide several sample UDFs that you can try out. You need to build the AsterixDB source to get the compiled UDF package. It can be found under the <tt>asterix-external-data</tt> sub-project under the path <tt>asterixdb/asterix-external-data/src/test/java/org/apache/asterix/external/library</tt>. After compilation, the UDFs will be packed in a zip file at <tt>asterixdb/asterix-external-data/target/asterix-external-data-$VERSION-testlib.zip</tt> which you can use to upload to the AsterixDB cluster.</p>
175<p>Assuming that these UDFs have been installed into the <tt>testlib</tt> library in<tt>udfs</tt> dataverse, here is an example that uses the sample UDF <tt>mysum</tt> to compute the sum of two input integers.</p>
176
177<div>
178<div>
179<pre class="source">USE udfs;
180
181CREATE FUNCTION mysum(a: int32, b: int32)
182RETURNS int32
183 AS &quot;org.apache.asterix.external.library.MySumFactory&quot; AT testlib;
184</pre></div></div>
185</div>
186<div class="section">
187<h2><a name="Creating_a_Python_UDF"></a><a name="PythonUDF" id="PythonUDF">Creating a Python UDF</a></h2>
188<p>Python UDFs need to be rolled into a <a class="externalLink" href="https://github.com/linkedin/shiv">shiv</a> package with all their dependencies. By default AsterixDB will use the Python interpreter located at <tt>/usr/bin/python3</tt>. This can be changed in the cluster config <tt>[common]</tt> section using the <tt>python.path</tt> configuration variable.</p>
189<p>First, let&#x2019;s devise a function that we would like to use in AsterixDB, <tt>sentiment_mod.py</tt></p>
190
191<div>
192<div>
193<pre class="source">import os
194from typing import Tuple
195class sent_model:
196
197 def __init__(self):
198 good_words = os.path.join(os.path.dirname(__file__), 'good.txt')
199 with open(good_words) as f:
200 self.whitelist = f.read().splitlines()
201
202 def sentiment(self, arg: Tuple[str])-&gt; str:
203 words = arg[0].split()
204 for word in words:
205 if word in self.whitelist:
206 return 'great'
207
208 return 'eh'
209</pre></div></div>
210
211<p>Furthermore, let&#x2019;s assume &#x2018;good.txt&#x2019; contains the following entries</p>
212
213<div>
214<div>
215<pre class="source">spam
216eggs
217ham
218</pre></div></div>
219
220<p>Now, in the module directory, execute <tt>shiv</tt> with all the dependencies of the module listed. We don&#x2019;t actually use scikit-learn here (our method is obviously better!), but it&#x2019;s just included as an example of a real dependency.</p>
221
222<div>
223<div>
224<pre class="source">shiv -o lib.pyz --site-packages . scikit-learn
225</pre></div></div>
226
227<p>Then, deploy it the same as the Java UDF was, with the library name <tt>pylib</tt> in <tt>udfs</tt> dataverse</p>
228
229<div>
230<div>
231<pre class="source">curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' localhost:19002/admin/udf/udfs/pylib
232</pre></div></div>
233
234<p>With the library deployed, we can define a function within it for use. For example, to expose the Python function <tt>sentiment</tt> in the module <tt>sentiment_mod</tt> in the class <tt>sent_model</tt>, the <tt>CREATE FUNCTION</tt> would be as follows</p>
235
236<div>
237<div>
238<pre class="source">USE udfs;
239
240CREATE FUNCTION sentiment(a)
241RETURNS TweetType
242 AS &quot;sentiment_mod&quot;, &quot;sent_model.sentiment&quot; AT pylib;
243</pre></div></div>
244
245<p>By default, AsterixDB will treat all external functions as deterministic. It means the function must return the same result for the same input, irrespective of when or how many times the function is called on that input. This particular function behaves the same on each input, so it satisfies the deterministic property. This enables better optimization of queries including this function. If a function is not deterministic then it should be declared as such by using a <tt>WITH</tt> sub-clause:</p>
246
247<div>
248<div>
249<pre class="source">USE udfs;
250
251CREATE FUNCTION sentiment(text)
252 AS &quot;sentiment_mod&quot;, &quot;sent_model.sentiment&quot; AT pylib
253 WITH { &quot;deterministic&quot;: false }
254</pre></div></div>
255
256<p>With the function now defined, it can then be used as any other scalar SQL++ function would be. For example:</p>
257
258<div>
259<div>
260<pre class="source">USE udfs;
261
262INSERT INTO Tweets([
263 {&quot;id&quot;:1, &quot;msg&quot;:&quot;spam is great&quot;},
264 {&quot;id&quot;:2, &quot;msg&quot;:&quot;i will not eat green eggs and ham&quot;},
265 {&quot;id&quot;:3, &quot;msg&quot;:&quot;bacon is better&quot;}
266]);
267
268SELECT t.msg as msg, sentiment(t.msg) as sentiment
269FROM Tweets t;
270</pre></div></div>
271</div>
272<div class="section">
273<h2><a name="Python_Type_Mappings"></a><a name="pytpes">Python Type Mappings</a></h2>
274<p>Currently only a subset of AsterixDB types are supported in Python UDFs. The supported types are as follows:</p>
275<ul>
276
277<li>Integer types (int8,16,32,64)</li>
278<li>Floating point types (float, double)</li>
279<li>String</li>
280<li>Boolean</li>
281<li>Arrays, Sets (cast to lists)</li>
282<li>Objects (cast to dict)</li>
283</ul>
284<p>Unsupported types can be cast to these in SQL++ first in order to be passed to a Python UDF</p></div>
285<div class="section">
286<h2><a name="Execution_Model_For_UDFs"></a><a name="execution">Execution Model For UDFs</a></h2>
287<p>AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified for the purposes of UDFs to - A pre-run phase which allocates resources, <tt>open</tt> - The time during which the job has data flowing through it, <tt>nextFrame</tt> - Cleanup and shutdown in <tt>close</tt>.</p>
288<p>If a SQL++ function is defined as a member of a class in the library, the class will be instantiated during <tt>open</tt>. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor of the object will greatly increase the performance of the SQL++ function.</p>
289<p>For each function invoked during a query, there will be an independent instance of the function per data partition. This means that the function must not assume there is any global state or that it can assume things about the layout of the data. The execution of the function will be parallel to the same degree as the level of data parallelism in the cluster.</p>
290<p>After initialization, the function bound in the SQL++ function definition is called once per tuple during the query execution (i.e. <tt>nextFrame</tt>). Unless the function specifies <tt>null-call</tt> in the <tt>WITH</tt> clause, <tt>NULL</tt> values will be skipped.</p>
291<p>At the close of the query, the function is torn down and not re-used in any way. All functions should assume that nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined.</p></div>
292<div class="section">
293<h2><a name="Attaching_a_UDF_on_Data_Feeds"></a><a name="UDFOnFeeds" id="UDFOnFeeds">Attaching a UDF on Data Feeds</a></h2>
294<p>In <a href="feeds.html">Data Ingestion using feeds</a>, we introduced an efficient way for users to get data into AsterixDB. In some use cases, users may want to pre-process the incoming data before storing it into the dataset. To meet this need, AsterixDB allows the user to attach a UDF onto the ingestion pipeline. Following the example in <a href="feeds.html">Data Ingestion</a>, here we show an example of how to attach a UDF that extracts the user names mentioned from the incoming Tweet text, storing the processed Tweets into a dataset.</p>
295<p>We start by creating the datatype and dataset that will be used for the feed and UDF. One thing to keep in mind is that data flows from the feed to the UDF and then to the dataset. This means that the feed&#x2019;s datatype should be the same as the input type of the UDF, and the output datatype of the UDF should be the same as the dataset&#x2019;s datatype. Thus, users should make sure that their datatypes are consistent in the UDF configuration. Users can also take advantage of open datatypes in AsterixDB by creating a minimum description of the data for simplicity. Here we use open datatypes:</p>
296
297<div>
298<div>
299<pre class="source">USE udfs;
300
301CREATE TYPE TweetType IF NOT EXISTS AS OPEN {
302 id: int64
303};
304
305CREATE DATASET ProcessedTweets(TweetType) PRIMARY KEY id;
306</pre></div></div>
307
308<p>As the <tt>TweetType</tt> is an open datatype, processed Tweets can be stored into the dataset after they are annotated with an extra attribute. Given the datatype and dataset above, we can create a Twitter Feed with the same datatype. Please refer to section <a href="feeds.html">Data Ingestion</a> if you have any trouble in creating feeds.</p>
309
310<div>
311<div>
312<pre class="source">USE udfs;
313
314CREATE FEED TwitterFeed WITH {
315 &quot;adapter-name&quot;: &quot;push_twitter&quot;,
316 &quot;type-name&quot;: &quot;TweetType&quot;,
317 &quot;format&quot;: &quot;twitter-status&quot;,
318 &quot;consumer.key&quot;: &quot;************&quot;,
319 &quot;consumer.secret&quot;: &quot;************&quot;,
320 &quot;access.token&quot;: &quot;**********&quot;,
321 &quot;access.token.secret&quot;: &quot;*************&quot;
322};
323</pre></div></div>
324
325<p>Then we define the function we want to apply to the feed</p>
326
327<div>
328<div>
329<pre class="source">USE udfs;
330
331CREATE FUNCTION addMentionedUsers(t: TweetType)
332 AS &quot;org.apache.asterix.external.library.AddMentionedUsersFactory&quot; AT testlib
333 WITH { &quot;resources&quot;: { &quot;textFieldName&quot;: &quot;text&quot; } };
334</pre></div></div>
335
336<p>After creating the feed, we attach the UDF onto the feed pipeline and start the feed with following statements:</p>
337
338<div>
339<div>
340<pre class="source">USE udfs;
341
342CONNECT FEED TwitterFeed TO DATASET ProcessedTweets APPLY FUNCTION addMentionedUsers;
343
344START FEED TwitterFeed;
345</pre></div></div>
346
347<p>You can check the annotated Tweets by querying the <tt>ProcessedTweets</tt> dataset:</p>
348
349<div>
350<div>
351<pre class="source">SELECT * FROM ProcessedTweets LIMIT 10;
352</pre></div></div>
353</div>
354<div class="section">
355<h2><a name="Installing_a_user-defined_Feed_Adapter"></a><a name="adapter">Installing a user-defined Feed Adapter</a></h2>
356<p>First, upload a zip file packaged the same way as a Java UDF, but also containing the adapter you would like to use. Next, issue a <tt>CREATE ADAPTER</tt> statement referencing the class name. For example:</p>
357
358<div>
359<div>
360<pre class="source">CREATE ADAPTER TweetAdapter
361 AS &quot;org.apache.asterix.external.library.adapter.TestTypedAdapterFactory&quot; AT testlib;
362</pre></div></div>
363
364<p>Then, the adapter can be used like any other adapter in a feed.</p>
365
366<div>
367<div>
368<pre class="source">CREATE FEED TweetFeed WITH {
369 &quot;adapter-name&quot;: &quot;TweetAdapter&quot;,
370 &quot;type-name&quot; : &quot;TweetType&quot;,
371 &quot;num_output_records&quot;: 4
372};
373</pre></div></div>
374</div>
375<div class="section">
376<h2><a name="Unstalling_an_UDF_Library"></a><a name="uninstall">Unstalling an UDF Library</a></h2>
377<p>If you want to uninstall the UDF library, simply issue a <tt>DELETE</tt> against the endpoint you <tt>POST</tt>ed against once all functions declared with the library are removed. First we&#x2019;ll drop the function we declared earlier:</p>
378
379<div>
380<div>
381<pre class="source">USE udfs;
382DROP FUNCTION mysum(a,b);
383</pre></div></div>
384
385<p>Then issue the proper <tt>DELETE</tt> request</p>
386
387<div>
388<div>
389<pre class="source">curl -u admin:admin -X DELETE localhost:19002/admin/udf/udfs/testlib
390</pre></div></div>
391
392<p>The library will also be dropped if you drop the dataverse entirely.</p></div>
393 </div>
394 </div>
395 </div>
396 <hr/>
397 <footer>
398 <div class="container-fluid">
399 <div class="row-fluid">
400<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
401 feather logo, and the Apache AsterixDB project logo are either
402 registered trademarks or trademarks of The Apache Software
403 Foundation in the United States and other countries.
404 All other marks mentioned may be trademarks or registered
405 trademarks of their respective owners.
406 </div>
407 </div>
408 </div>
409 </footer>
410 </body>
411</html>