blob: e5816d4e7d9fccb492bcbfed20732fe645a28d55 [file] [log] [blame]
Ian Maxond00eca82018-10-05 17:29:55 -07001<!DOCTYPE html>
2<!--
Ian Maxon41b806c2019-03-07 15:58:20 -08003 | Generated by Apache Maven Doxia Site Renderer 1.8.1 from src/site/markdown/udf.md at 2019-03-07
Ian Maxond00eca82018-10-05 17:29:55 -07004 | Rendered using Apache Maven Fluido Skin 1.7
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
Ian Maxon41b806c2019-03-07 15:58:20 -080010 <meta name="Date-Revision-yyyymmdd" content="20190307" />
Ian Maxond00eca82018-10-05 17:29:55 -070011 <meta http-equiv="Content-Language" content="en" />
Ian Maxon41b806c2019-03-07 15:58:20 -080012 <title>AsterixDB &#x2013; Support for User Defined Functions in AsterixDB</title>
Ian Maxond00eca82018-10-05 17:29:55 -070013 <link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16 <script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
17
18 </head>
19 <body class="topBarDisabled">
20 <div class="container-fluid">
21 <div id="banner">
22 <div class="pull-left"><a href="./" id="bannerLeft"><img src="images/asterixlogo.png" alt="AsterixDB"/></a></div>
23 <div class="pull-right"></div>
24 <div class="clear"><hr/></div>
25 </div>
26
27 <div id="breadcrumbs">
28 <ul class="breadcrumb">
Ian Maxon41b806c2019-03-07 15:58:20 -080029 <li id="publishDate">Last Published: 2019-03-07</li>
Ian Maxond00eca82018-10-05 17:29:55 -070030 <li id="projectVersion" class="pull-right">Version: 0.9.4</li>
31 <li class="pull-right"><a href="index.html" title="Documentation Home">Documentation Home</a></li>
32 </ul>
33 </div>
34 <div class="row-fluid">
35 <div id="leftColumn" class="span2">
36 <div class="well sidebar-nav">
37 <ul class="nav nav-list">
38 <li class="nav-header">Get Started - Installation</li>
39 <li><a href="ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
40 <li><a href="ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
41 <li><a href="aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
42 <li class="nav-header">AsterixDB Primer</li>
Ian Maxon41b806c2019-03-07 15:58:20 -080043 <li><a href="sqlpp/primer-sqlpp.html" title="Option 1: using SQL++"><span class="none"></span>Option 1: using SQL++</a></li>
44 <li><a href="aql/primer.html" title="Option 2: using AQL"><span class="none"></span>Option 2: using AQL</a></li>
Ian Maxond00eca82018-10-05 17:29:55 -070045 <li class="nav-header">Data Model</li>
46 <li><a href="datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
Ian Maxon41b806c2019-03-07 15:58:20 -080047 <li class="nav-header">Queries - SQL++</li>
Ian Maxond00eca82018-10-05 17:29:55 -070048 <li><a href="sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
49 <li><a href="sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
Ian Maxon41b806c2019-03-07 15:58:20 -080050 <li class="nav-header">Queries - AQL</li>
51 <li><a href="aql/manual.html" title="The Asterix Query Language (AQL)"><span class="none"></span>The Asterix Query Language (AQL)</a></li>
52 <li><a href="aql/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
Ian Maxond00eca82018-10-05 17:29:55 -070053 <li class="nav-header">API/SDK</li>
54 <li><a href="api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
55 <li><a href="csv.html" title="CSV Output"><span class="none"></span>CSV Output</a></li>
56 <li class="nav-header">Advanced Features</li>
Ian Maxon41b806c2019-03-07 15:58:20 -080057 <li><a href="aql/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
Ian Maxond00eca82018-10-05 17:29:55 -070058 <li><a href="aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
Ian Maxon41b806c2019-03-07 15:58:20 -080059 <li><a href="feeds/tutorial.html" title="Support for Data Ingestion"><span class="none"></span>Support for Data Ingestion</a></li>
Ian Maxond00eca82018-10-05 17:29:55 -070060 <li class="active"><a href="#"><span class="none"></span>User Defined Functions</a></li>
Ian Maxon41b806c2019-03-07 15:58:20 -080061 <li><a href="aql/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
62 <li><a href="aql/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
Ian Maxond00eca82018-10-05 17:29:55 -070063</ul>
64 <hr />
65 <div id="poweredBy">
66 <div class="clear"></div>
67 <div class="clear"></div>
68 <div class="clear"></div>
69 <div class="clear"></div>
70<a href="./" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" /></a>
71 </div>
72 </div>
73 </div>
74 <div id="bodyColumn" class="span10" >
75<!--
76 ! Licensed to the Apache Software Foundation (ASF) under one
77 ! or more contributor license agreements. See the NOTICE file
78 ! distributed with this work for additional information
79 ! regarding copyright ownership. The ASF licenses this file
80 ! to you under the Apache License, Version 2.0 (the
81 ! "License"); you may not use this file except in compliance
82 ! with the License. You may obtain a copy of the License at
83 !
84 ! http://www.apache.org/licenses/LICENSE-2.0
85 !
86 ! Unless required by applicable law or agreed to in writing,
87 ! software distributed under the License is distributed on an
88 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
89 ! KIND, either express or implied. See the License for the
90 ! specific language governing permissions and limitations
91 ! under the License.
92 !-->
Ian Maxon41b806c2019-03-07 15:58:20 -080093<h1>Support for User Defined Functions in AsterixDB</h1>
Ian Maxond00eca82018-10-05 17:29:55 -070094<div class="section">
95<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
96<ul>
97
Ian Maxon41b806c2019-03-07 15:58:20 -080098<li><a href="#PreprocessingCollectedData">Using UDF to preprocess feed-collected data</a></li>
99<li><a href="#WritingAnExternalUDF">Writing an External UDF</a></li>
100<li><a href="#CreatingAnAsterixDBLibrary">Creating an AsterixDB Library</a></li>
101<li><a href="#installingUDF">Installing an AsterixDB Library</a></li>
102</ul>
103<p>In this document, we describe the support for implementing, using, and installing user-defined functions (UDF) in AsterixDB. We will explain how we can use UDFs to preprocess, e.g., data collected using feeds (see the <a href="feeds/tutorial.html">feeds tutorial</a>).</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700104<div class="section">
Ian Maxon41b806c2019-03-07 15:58:20 -0800105<h3><a name="Installing_an_AsterixDB_Library"></a><a name="installingUDF">Installing an AsterixDB Library</a></h3>
106<p>We assume you have followed the <a href="../install.html">installation instructions</a> to set up a running AsterixDB instance. Let us refer your AsterixDB instance by the name &#x201c;my_asterix&#x201d;.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700107<ul>
108
109<li>
110
Ian Maxon41b806c2019-03-07 15:58:20 -0800111<p>Step 1: Stop the AsterixDB instance if it is in the ACTIVE state.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700112
113<div>
114<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800115<pre class="source">$ managix stop -n my_asterix
Ian Maxond00eca82018-10-05 17:29:55 -0700116</pre></div></div>
117</li>
118<li>
119
Ian Maxon41b806c2019-03-07 15:58:20 -0800120<p>Step 2: Install the library using Managix install command. Just to illustrate, we use the help command to look up the syntax</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700121
122<div>
123<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800124<pre class="source">$ managix help -cmd install
125Installs a library to an asterix instance.
126Options
127n Name of Asterix Instance
128d Name of the dataverse under which the library will be installed
129l Name of the library
130p Path to library zip bundle
Ian Maxond00eca82018-10-05 17:29:55 -0700131</pre></div></div>
132</li>
133</ul>
Ian Maxon41b806c2019-03-07 15:58:20 -0800134<p>Above is a sample output and explains the usage and the required parameters. Each library has a name and is installed under a dataverse. Recall that we had created a dataverse by the name - &#x201c;feeds&#x201d; prior to creating our datatypes and dataset. We shall name our library - &#x201c;testlib&#x201d;.</p>
135<p>We assume you have a library zip bundle that needs to be installed. To install the library, use the Managix install command. An example is shown below.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700136
137<div>
138<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800139<pre class="source"> $ managix install -n my_asterix -d feeds -l testlib -p extlibs/asterix-external-data-0.8.7-binary-assembly.zip
Ian Maxond00eca82018-10-05 17:29:55 -0700140</pre></div></div>
141
Ian Maxon41b806c2019-03-07 15:58:20 -0800142<p>You should see the following message:</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700143
144<div>
145<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800146<pre class="source"> INFO: Installed library testlib
Ian Maxond00eca82018-10-05 17:29:55 -0700147</pre></div></div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800148
149<p>We shall next start our AsterixDB instance using the start command as shown below.</p>
150
151<div>
152<div>
153<pre class="source"> $ managix start -n my_asterix
154</pre></div></div>
155
156<p>You may now use the AsterixDB library in AQL statements and queries. To look at the installed artifacts, you may execute the following query at the AsterixDB web-console.</p>
157
158<div>
159<div>
160<pre class="source"> for $x in dataset Metadata.Function
161 return $x
162
163 for $x in dataset Metadata.Library
164 return $x
165</pre></div></div>
166
167<p>Our library is now installed and is ready to be used.</p></div></div>
Ian Maxond00eca82018-10-05 17:29:55 -0700168<div class="section">
Ian Maxon41b806c2019-03-07 15:58:20 -0800169<h2><a name="Preprocessing_Collected_Data"></a><a name="PreprocessingCollectedData" id="PreprocessingCollectedData">Preprocessing Collected Data</a></h2>
170<p>In the following we assume that you already created the <tt>TwitterFeed</tt> and its corresponding data types and dataset following the instruction explained in the <a href="feeds/tutorial.html">feeds tutorial</a>.</p>
171<p>A feed definition may optionally include the specification of a user-defined function that is to be applied to each feed object prior to persistence. Examples of pre-processing might include adding attributes, filtering out objects, sampling, sentiment analysis, feature extraction, etc. We can express a UDF, which can be defined in AQL or in a programming language such as Java, to perform such pre-processing. An AQL UDF is a good fit when pre-processing a object requires the result of a query (join or aggregate) over data contained in AsterixDB datasets. More sophisticated processing such as sentiment analysis of text is better handled by providing a Java UDF. A Java UDF has an initialization phase that allows the UDF to access any resources it may need to initialize itself prior to being used in a data flow. It is assumed by the AsterixDB compiler to be stateless and thus usable as an embarrassingly parallel black box. In contrast, the AsterixDB compiler can reason about an AQL UDF and involve the use of indexes during its invocation.</p>
172<p>We consider an example transformation of a raw tweet into its lightweight version called <tt>ProcessedTweet</tt>, which is defined next.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700173
174<div>
175<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800176<pre class="source"> use dataverse feeds;
Ian Maxond00eca82018-10-05 17:29:55 -0700177
Ian Maxon41b806c2019-03-07 15:58:20 -0800178 create type ProcessedTweet if not exists as open {
179 id: string,
180 user_name:string,
181 location:point,
182 created_at:string,
183 message_text:string,
184 country: string,
185 topics: {{string}}
Ian Maxond00eca82018-10-05 17:29:55 -0700186 };
187
Ian Maxon41b806c2019-03-07 15:58:20 -0800188 create dataset ProcessedTweets(ProcessedTweet)
189 primary key id;
Ian Maxond00eca82018-10-05 17:29:55 -0700190</pre></div></div>
191
Ian Maxon41b806c2019-03-07 15:58:20 -0800192<p>The processing required in transforming a collected tweet to its lighter version of type <tt>ProcessedTweet</tt> involves extracting the topics or hash-tags (if any) in a tweet and collecting them in the referred &#x201c;topics&#x201d; attribute for the tweet. Additionally, the latitude and longitude values (doubles) are combined into the spatial point type. Note that spatial data types are considered as first-class citizens that come with the support for creating indexes. Next we show a revised version of our example TwitterFeed that involves the use of a UDF. We assume that the UDF that contains the transformation logic into a &#x201c;ProcessedTweet&#x201d; is available as a Java UDF inside an AsterixDB library named &#x2018;testlib&#x2019;. We defer the writing of a Java UDF and its installation as part of an AsterixDB library to a later section of this document.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700193
194<div>
195<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800196<pre class="source"> use dataverse feeds;
Ian Maxond00eca82018-10-05 17:29:55 -0700197
Ian Maxon41b806c2019-03-07 15:58:20 -0800198 create feed ProcessedTwitterFeed if not exists
199 using &quot;push_twitter&quot;
200 ((&quot;type-name&quot;=&quot;Tweet&quot;),
201 (&quot;consumer.key&quot;=&quot;************&quot;),
202 (&quot;consumer.secret&quot;=&quot;**************&quot;),
203 (&quot;access.token&quot;=&quot;**********&quot;),
204 (&quot;access.token.secret&quot;=&quot;*************&quot;))
205
206 apply function testlib#addHashTagsInPlace;
Ian Maxond00eca82018-10-05 17:29:55 -0700207</pre></div></div>
208
Ian Maxon41b806c2019-03-07 15:58:20 -0800209<p>Note that a feed adaptor and a UDF act as pluggable components. These contribute towards providing a generic &#x201c;plug-and-play&#x201d; model where custom implementations can be provided to cater to specific requirements.</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700210<div class="section">
Ian Maxond00eca82018-10-05 17:29:55 -0700211<div class="section">
Ian Maxon41b806c2019-03-07 15:58:20 -0800212<h4><a name="Building_a_Cascade_Network_of_Feeds"></a>Building a Cascade Network of Feeds</h4>
213<p>Multiple high-level applications may wish to consume the data ingested from a data feed. Each such application might perceive the feed in a different way and require the arriving data to be processed and/or persisted differently. Building a separate flow of data from the external source for each application is wasteful of resources as the pre-processing or transformations required by each application might overlap and could be done together in an incremental fashion to avoid redundancy. A single flow of data from the external source could provide data for multiple applications. To achieve this, we introduce the notion of primary and secondary feeds in AsterixDB.</p>
214<p>A feed in AsterixDB is considered to be a primary feed if it gets its data from an external data source. The objects contained in a feed (subsequent to any pre-processing) are directed to a designated AsterixDB dataset. Alternatively or additionally, these objects can be used to derive other feeds known as secondary feeds. A secondary feed is similar to its parent feed in every other aspect; it can have an associated UDF to allow for any subsequent processing, can be persisted into a dataset, and/or can be made to derive other secondary feeds to form a cascade network. A primary feed and a dependent secondary feed form a hierarchy. As an example, we next show an example AQL statement that redefines the previous feed &#x201c;ProcessedTwitterFeed&#x201d; in terms of their respective parent feed (TwitterFeed).</p>
Ian Maxond00eca82018-10-05 17:29:55 -0700215
216<div>
217<div>
Ian Maxon41b806c2019-03-07 15:58:20 -0800218<pre class="source"> use dataverse feeds;
219
220 drop feed ProcessedTwitterFeed if exists;
221
222 create secondary feed ProcessedTwitterFeed from feed TwitterFeed
223 apply function testlib#addHashTags;
224
225 connect feed ProcessedTwitterFeed to dataset ProcessedTweets;
226</pre></div></div>
227
228<p>The <tt>addHashTags</tt> function is already provided in the example UDF.To see what objects are being inserted into the dataset, we can perform a simple dataset scan after allowing a few moments for the feed to start ingesting data:</p>
229
230<div>
231<div>
232<pre class="source"> use dataverse feeds;
233
234 for $i in dataset ProcessedTweets limit 10 return $i;
235</pre></div></div>
236
237<p>For an example of how to write a Java UDF from scratch, the source for the example UDF that has been used in this tutorial is available [here] (<a class="externalLink" href="https://github.com/apache/asterixdb/tree/master/asterixdb/asterix-external-data/src/test/java/org/apache/asterix/external/library">https://github.com/apache/asterixdb/tree/master/asterixdb/asterix-external-data/src/test/java/org/apache/asterix/external/library</a>)</p></div></div></div>
238<div class="section">
239<h2><a name="Unstalling_an_AsterixDB_Library"></a><a name="installingUDF">Unstalling an AsterixDB Library</a></h2>
240<p>To uninstall a library, use the Managix uninstall command as follows:</p>
241
242<div>
243<div>
244<pre class="source"> $ managix stop -n my_asterix
245
246 $ managix uninstall -n my_asterix -d feeds -l testlib
Ian Maxond00eca82018-10-05 17:29:55 -0700247</pre></div></div></div>
248 </div>
249 </div>
250 </div>
251 <hr/>
252 <footer>
253 <div class="container-fluid">
254 <div class="row-fluid">
255<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
256 feather logo, and the Apache AsterixDB project logo are either
257 registered trademarks or trademarks of The Apache Software
258 Foundation in the United States and other countries.
259 All other marks mentioned may be trademarks or registered
260 trademarks of their respective owners.
261 </div>
262 </div>
263 </div>
264 </footer>
265 </body>
266</html>