blob: 64ab07ba66ac2d94311fa479e4d2290e28221ebc [file] [log] [blame]
Ian Maxona1cc51b2020-08-07 13:11:35 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia Site Renderer 1.8.1 from target/generated-site/markdown/udf.md at 2020-08-07
4 | Rendered using Apache Maven Fluido Skin 1.7
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20200807" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; User-defined Functions</title>
13 <link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16 <script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
17
18 </head>
19 <body class="topBarDisabled">
20 <div class="container-fluid">
21 <div id="banner">
22 <div class="pull-left"><a href="./" id="bannerLeft"><img src="images/asterixlogo.png" alt="AsterixDB"/></a></div>
23 <div class="pull-right"></div>
24 <div class="clear"><hr/></div>
25 </div>
26
27 <div id="breadcrumbs">
28 <ul class="breadcrumb">
29 <li id="publishDate">Last Published: 2020-08-07</li>
30 <li id="projectVersion" class="pull-right">Version: 0.9.5</li>
31 <li class="pull-right"><a href="index.html" title="Documentation Home">Documentation Home</a></li>
32 </ul>
33 </div>
34 <div class="row-fluid">
35 <div id="leftColumn" class="span2">
36 <div class="well sidebar-nav">
37 <ul class="nav nav-list">
38 <li class="nav-header">Get Started - Installation</li>
39 <li><a href="ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
40 <li><a href="ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
41 <li><a href="aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
42 <li class="nav-header">AsterixDB Primer</li>
43 <li><a href="sqlpp/primer-sqlpp.html" title="Using SQL++"><span class="none"></span>Using SQL++</a></li>
44 <li class="nav-header">Data Model</li>
45 <li><a href="datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
46 <li class="nav-header">Queries</li>
47 <li><a href="sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
48 <li><a href="sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
49 <li class="nav-header">API/SDK</li>
50 <li><a href="api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
51 <li><a href="csv.html" title="CSV Output"><span class="none"></span>CSV Output</a></li>
52 <li class="nav-header">Advanced Features</li>
53 <li><a href="aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
54 <li><a href="feeds.html" title="Data Ingestion with Feeds"><span class="none"></span>Data Ingestion with Feeds</a></li>
55 <li class="active"><a href="#"><span class="none"></span>User Defined Functions</a></li>
56 <li><a href="sqlpp/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
57 <li><a href="sqlpp/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
58 <li><a href="sqlpp/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
59 <li class="nav-header">Deprecated</li>
60 <li><a href="aql/primer.html" title="AsterixDB Primer: Using AQL"><span class="none"></span>AsterixDB Primer: Using AQL</a></li>
61 <li><a href="aql/manual.html" title="Queries: The Asterix Query Language (AQL)"><span class="none"></span>Queries: The Asterix Query Language (AQL)</a></li>
62 <li><a href="aql/builtins.html" title="Queries: Builtin Functions (AQL)"><span class="none"></span>Queries: Builtin Functions (AQL)</a></li>
63</ul>
64 <hr />
65 <div id="poweredBy">
66 <div class="clear"></div>
67 <div class="clear"></div>
68 <div class="clear"></div>
69 <div class="clear"></div>
70<a href="./" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" /></a>
71 </div>
72 </div>
73 </div>
74 <div id="bodyColumn" class="span10" >
75<!--
76 ! Licensed to the Apache Software Foundation (ASF) under one
77 ! or more contributor license agreements. See the NOTICE file
78 ! distributed with this work for additional information
79 ! regarding copyright ownership. The ASF licenses this file
80 ! to you under the Apache License, Version 2.0 (the
81 ! "License"); you may not use this file except in compliance
82 ! with the License. You may obtain a copy of the License at
83 !
84 ! http://www.apache.org/licenses/LICENSE-2.0
85 !
86 ! Unless required by applicable law or agreed to in writing,
87 ! software distributed under the License is distributed on an
88 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
89 ! KIND, either express or implied. See the License for the
90 ! specific language governing permissions and limitations
91 ! under the License.
92 !-->
93<h1>User-defined Functions</h1>
94<div class="section">
95<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
96<ul>
97
98<li><a href="#introduction">Introduction</a></li>
99<li><a href="#installingUDF">Installing an UDF Library</a></li>
100<li><a href="#UDFOnFeeds">Attaching an UDF on Data Feeds</a></li>
101<li><a href="#udfConfiguration">A quick look of the UDF configuration</a></li>
102<li><a href="#uninstall">Unstalling an UDF Library</a><!--
103! Licensed to the Apache Software Foundation (ASF) under one
104! or more contributor license agreements. See the NOTICE file
105! distributed with this work for additional information
106! regarding copyright ownership. The ASF licenses this file
107! to you under the Apache License, Version 2.0 (the
108! "License"); you may not use this file except in compliance
109! with the License. You may obtain a copy of the License at
110!
111! http://www.apache.org/licenses/LICENSE-2.0
112!
113! Unless required by applicable law or agreed to in writing,
114! software distributed under the License is distributed on an
115! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
116! KIND, either express or implied. See the License for the
117! specific language governing permissions and limitations
118! under the License.
119!--></li>
120</ul></div>
121<div class="section">
122<h2><a name="Introduction"></a><a name="introduction">Introduction</a></h2>
123<p>Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java and Python A user can encapsulate data processing logic into a UDF and invoke it later repeatedly. For SQL++ functions, a user can refer to <a href="sqlpp/manual.html#Functions">SQL++ Functions</a> for their usages. This document will focus on UDFs in languages other than SQL++</p></div>
124<div class="section">
125<h2><a name="Endpoints_and_Authentication"></a><a name="authentication">Endpoints and Authentication</a></h2>
126<p>The UDF endpoint is not enabled by default until authentication has been configured properly. To enable it, we will need to set the path to the credential file and populate it with our username and password.</p>
127<p>The credential file is a simple <tt>/etc/passwd</tt> style text file with usernames and corresponding <tt>bcrypt</tt> hashed and salted passwords. You can populate this on your own if you would like, but the <tt>asterixhelper</tt> utility can write the entries as well. We can invoke <tt>asterixhelper</tt> like so:</p>
128
129<div>
130<div>
131<pre class="source">$ bin/asterixhelper -u admin -p admin -cp opt/local/conf add_credential
132</pre></div></div>
133
134<p>Then, in your <tt>cc.conf</tt>, in the <tt>[cc]</tt> section, add the correct <tt>credential.file</tt> path</p>
135
136<div>
137<div>
138<pre class="source">[cc]
139address = 127.0.0.1
140...
141...
142credential.file = conf/passwd
143</pre></div></div>
144
145<p>Now,restart the cluster if it was already started to allow the Cluster Controller to find the new credentials.</p></div>
146<div class="section">
147<h2><a name="Installing_a_Java_UDF_Library"></a><a name="installingUDF">Installing a Java UDF Library</a></h2>
148<p>To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the <tt>/admin/udf</tt> endpoint of the CC at the normal API port (<tt>19002</tt> by default). The request should use HTTP Basic authentication. This means your credentials will <i>not</i> be obfuscated or encrypted <i>in any way</i>, so submit to this endpoint over localhost or a network where you know your traffic is safe from eavesdropping. Any suitable tool will do, but for the example here I will use <tt>curl</tt> which is widely available.</p>
149<p>For example, to install a library with the following criteria:</p>
150<ul>
151
152<li><tt>udfs</tt> dataverse name</li>
153<li>with a new Library name of <tt>testlib</tt></li>
154<li>from <tt>lib.zip</tt> in the present working directory</li>
155<li>to the cluster at <tt>localhost</tt> with API port <tt>19002</tt></li>
156<li>with credentials being a username and password of <tt>admin:admin</tt></li>
157</ul>
158<p>we would execute</p>
159
160<div>
161<div>
162<pre class="source">curl -v -u admin:admin -X POST -F 'data=@./lib.zip' localhost:19002/admin/udf/udfs/testlib
163</pre></div></div>
164
165<p>Any response other than <tt>200</tt> indicates an error in deployment.</p>
166<p>In the AsterixDB source release, we provide several sample UDFs that you can try out. You need to build the AsterixDB source to get the compiled UDF package. It can be found under the <tt>asterixdb-external</tt> sub-project. Assuming that these UDFs have been installed into the <tt>udfs</tt> dataverse and <tt>testlib</tt> library, here is an example that uses the sample UDF <tt>mysum</tt> to compute the sum of two input integers.</p>
167
168<div>
169<div>
170<pre class="source">USE udfs;
171
172CREATE FUNCTION mysum(a: int32, b: int32)
173RETURNS int32
174LANGUAGE JAVA
175AS &quot;testlib&quot;,&quot;org.apache.asterix.external.library.MySumFactory&quot;;
176</pre></div></div>
177</div>
178<div class="section">
179<h2><a name="Creating_a_Python_UDF"></a><a name="PythonUDF" id="PythonUDF">Creating a Python UDF</a></h2>
180<p>Python UDFs need to be rolled into a <a class="externalLink" href="https://github.com/linkedin/shiv">shiv</a> package with all their dependencies. By default AsterixDB will use the Python interpreter located at <tt>/usr/bin/python3</tt>. This can be changed in the cluster config <tt>[common]</tt> section using the <tt>python.path</tt> configuration variable.</p>
181<p>First, let&#x2019;s devise a function that we would like to use in AsterixDB, <tt>sentiment_mod.py</tt></p>
182
183<div>
184<div>
185<pre class="source">import os
186from typing import Tuple
187class sent_model:
188
189 def __init__(self):
190 good_words = os.path.join(os.path.dirname(__file__), 'good.txt')
191 with open(good_words) as f:
192 self.whitelist = f.read().splitlines()
193
194 def sentiment(self, arg: Tuple[str])-&gt; str:
195 words = arg[0].split()
196 for word in words:
197 if word in self.whitelist:
198 return 'great'
199
200 return 'eh'
201</pre></div></div>
202
203<p>Furthermore, let&#x2019;s assume &#x2018;good.txt&#x2019; contains the following entries</p>
204
205<div>
206<div>
207<pre class="source">spam
208eggs
209ham
210</pre></div></div>
211
212<p>Now, in the module directory, execute <tt>shiv</tt> with all the dependencies of the module listed. We don&#x2019;t actually use scikit-learn here (our method is obviously better!), but it&#x2019;s just included as an example of a real dependency.</p>
213
214<div>
215<div>
216<pre class="source">shiv -o lib.pyz --site-packages . scikit-learn
217</pre></div></div>
218
219<p>Then, deploy it the same as the Java UDF was, with the library name <tt>pylib</tt></p>
220
221<div>
222<div>
223<pre class="source">curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' localhost:19002/admin/udf/udfs/pylib
224</pre></div></div>
225
226<p>With the library deployed, we can define a function within it for use. For example, to expose the Python function <tt>sentiment</tt> in the module <tt>sentiment_mod</tt> in the class <tt>sent_model</tt>, the <tt>CREATE FUNCTION</tt> would be as follows</p>
227
228<div>
229<div>
230<pre class="source">USE udfs;
231
232CREATE FUNCTION sentiment(a)
233LANGUAGE PYTHON DETERMINISTIC
234AS &quot;pylib&quot;,&quot;sentiment_mod:sent_model&quot;;
235</pre></div></div>
236
237<p>By default, AsterixDB will treat all external functions as <tt>NOT DETERMINISTIC</tt>. Loosely this means the result might change depending on when the function is called, regardless of the input. This function behaves the same on each input, so we can safely call it <tt>DETERMINISTIC</tt>. This will enable better optimization of queries including this function.</p>
238<p>With the function now defined, it can then be used as any other scalar SQL++ function would be. For example:</p>
239
240<div>
241<div>
242<pre class="source">USE udfs;
243
244INSERT INTO Tweets([
245{&quot;id&quot;:1, &quot;msg&quot;:&quot;spam is great&quot;},
246{&quot;id&quot;:2, &quot;msg&quot;:&quot;i will not eat green eggs and ham&quot;},
247{&quot;id&quot;:3, &quot;msg&quot;:&quot;bacon is better&quot;}]);
248
249USE udfs;
250SELECT t.msg as msg, sentiment(t.msg) as sentiment
251FROM Tweets t;
252</pre></div></div>
253</div>
254<div class="section">
255<h2><a name="Attaching_a_UDF_on_Data_Feeds"></a><a name="UDFOnFeeds" id="UDFOnFeeds">Attaching a UDF on Data Feeds</a></h2>
256<p>In <a href="feeds.html">Data Ingestion using feeds</a>, we introduced an efficient way for users to get data into AsterixDB. In some use cases, users may want to pre-process the incoming data before storing it into the dataset. To meet this need, AsterixDB allows the user to attach a UDF onto the ingestion pipeline. Following the example in <a href="feeds.html">Data Ingestion</a>, here we show an example of how to attach a UDF that extracts the user names mentioned from the incoming Tweet text, storing the processed Tweets into a dataset.</p>
257<p>We start by creating the datatype and dataset that will be used for the feed and UDF. One thing to keep in mind is that data flows from the feed to the UDF and then to the dataset. This means that the feed&#x2019;s datatype should be the same as the input type of the UDF, and the output datatype of the UDF should be the same as the dataset&#x2019;s datatype. Thus, users should make sure that their datatypes are consistent in the UDF configuration. Users can also take advantage of open datatypes in AsterixDB by creating a minimum description of the data for simplicity. Here we use open datatypes:</p>
258
259<div>
260<div>
261<pre class="source">USE udfs;
262
263CREATE TYPE TweetType IF NOT EXISTS AS OPEN {
264 id: int64
265};
266
267CREATE DATASET ProcessedTweets(TweetType) PRIMARY KEY id;
268</pre></div></div>
269
270<p>As the <tt>TweetType</tt> is an open datatype, processed Tweets can be stored into the dataset after they are annotated with an extra attribute. Given the datatype and dataset above, we can create a Twitter Feed with the same datatype. Please refer to section <a href="feeds.html">Data Ingestion</a> if you have any trouble in creating feeds.</p>
271
272<div>
273<div>
274<pre class="source">USE udfs;
275
276CREATE FEED TwitterFeed WITH {
277 &quot;adapter-name&quot;: &quot;push_twitter&quot;,
278 &quot;type-name&quot;: &quot;TweetType&quot;,
279 &quot;format&quot;: &quot;twitter-status&quot;,
280 &quot;consumer.key&quot;: &quot;************&quot;,
281 &quot;consumer.secret&quot;: &quot;************&quot;,
282 &quot;access.token&quot;: &quot;**********&quot;,
283 &quot;access.token.secret&quot;: &quot;*************&quot;
284};
285</pre></div></div>
286
287<p>Then we define the function we want to apply to the feed</p>
288<p>USE udfs;</p>
289<p>CREATE FUNCTION addMentionedUsers(t: TweetType) RETURNS TweetType LANGUAGE JAVA as &#x201c;testlib&#x201d;,&#x201c;org.apache.asterix.external.library.AddMentionedUsersFactory&#x201d; WITH {&#x201c;textFieldName&#x201d;: &#x201c;text&#x201d;};</p>
290<p>After creating the feed, we attach the UDF onto the feed pipeline and start the feed with following statements:</p>
291
292<div>
293<div>
294<pre class="source">USE udfs;
295
296CONNECT FEED TwitterFeed TO DATASET ProcessedTweets APPLY FUNCTION addMentionedUsers;
297
298START FEED TwitterFeed;
299</pre></div></div>
300
301<p>You can check the annotated Tweets by querying the <tt>ProcessedTweets</tt> dataset:</p>
302
303<div>
304<div>
305<pre class="source">SELECT * FROM ProcessedTweets LIMIT 10;
306</pre></div></div>
307</div>
308<div class="section">
309<h2><a name="Unstalling_an_UDF_Library"></a><a name="uninstall">Unstalling an UDF Library</a></h2>
310<p>If you want to uninstall the UDF library, simply issue a <tt>DELETE</tt> against the endpoint you <tt>POST</tt>ed against once all functions declared with the library are removed. First we&#x2019;ll drop the function we declared earlier:</p>
311
312<div>
313<div>
314<pre class="source">USE udfs;
315DROP FUNCTION mysum@2;
316</pre></div></div>
317
318<p>Then issue the proper <tt>DELETE</tt> request</p>
319
320<div>
321<div>
322<pre class="source">curl -u admin:admin -X DELETE localhost:19002/admin/udf/udfs/testlib
323</pre></div></div>
324
325<p>The library will also be dropped if you drop the dataverse entirely.</p></div>
326 </div>
327 </div>
328 </div>
329 <hr/>
330 <footer>
331 <div class="container-fluid">
332 <div class="row-fluid">
333<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
334 feather logo, and the Apache AsterixDB project logo are either
335 registered trademarks or trademarks of The Apache Software
336 Foundation in the United States and other countries.
337 All other marks mentioned may be trademarks or registered
338 trademarks of their respective owners.
339 </div>
340 </div>
341 </div>
342 </footer>
343 </body>
344</html>