blob: 99e36d76db5cf2682a0df766a3a9e6a668f3f114 [file] [log] [blame]
Ian Maxon9c40a662018-02-09 12:42:56 -08001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2018-02-09
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20180209" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; Support for User Defined Functions in AsterixDB</title>
13 <link rel="stylesheet" href="./css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="./js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
22<script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
23 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
24 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
25 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
26
27 ga('create', 'UA-41536543-1', 'uci.edu');
28 ga('send', 'pageview');</script>
29
30 </head>
31 <body class="topBarDisabled">
32
33
34
35
36 <div class="container-fluid">
37 <div id="banner">
38 <div class="pull-left">
39 <a href="./" id="bannerLeft">
40 <img src="images/asterixlogo.png" alt="AsterixDB"/>
41 </a>
42 </div>
43 <div class="pull-right"> </div>
44 <div class="clear"><hr/></div>
45 </div>
46
47 <div id="breadcrumbs">
48 <ul class="breadcrumb">
49
50
51 <li id="publishDate">Last Published: 2018-02-09</li>
52
53
54
55 <li id="projectVersion" class="pull-right">Version: 0.9.3</li>
56
57 <li class="divider pull-right">|</li>
58
59 <li class="pull-right"> <a href="index.html" title="Documentation Home">
60 Documentation Home</a>
61 </li>
62
63 </ul>
64 </div>
65
66
67 <div class="row-fluid">
68 <div id="leftColumn" class="span3">
69 <div class="well sidebar-nav">
70
71
72 <ul class="nav nav-list">
73 <li class="nav-header">Get Started - Installation</li>
74
75 <li>
76
77 <a href="ncservice.html" title="Option 1: using NCService">
78 <i class="none"></i>
79 Option 1: using NCService</a>
80 </li>
81
82 <li>
83
84 <a href="ansible.html" title="Option 2: using Ansible">
85 <i class="none"></i>
86 Option 2: using Ansible</a>
87 </li>
88
89 <li>
90
91 <a href="aws.html" title="Option 3: using Amazon Web Services">
92 <i class="none"></i>
93 Option 3: using Amazon Web Services</a>
94 </li>
95
96 <li>
97
98 <a href="yarn.html" title="Option 4: using YARN">
99 <i class="none"></i>
100 Option 4: using YARN</a>
101 </li>
102
103 <li>
104
105 <a href="install.html" title="Option 5: using Managix (deprecated)">
106 <i class="none"></i>
107 Option 5: using Managix (deprecated)</a>
108 </li>
109 <li class="nav-header">AsterixDB Primer</li>
110
111 <li>
112
113 <a href="sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
114 <i class="none"></i>
115 Option 1: using SQL++</a>
116 </li>
117
118 <li>
119
120 <a href="aql/primer.html" title="Option 2: using AQL">
121 <i class="none"></i>
122 Option 2: using AQL</a>
123 </li>
124 <li class="nav-header">Data Model</li>
125
126 <li>
127
128 <a href="datamodel.html" title="The Asterix Data Model">
129 <i class="none"></i>
130 The Asterix Data Model</a>
131 </li>
132 <li class="nav-header">Queries - SQL++</li>
133
134 <li>
135
136 <a href="sqlpp/manual.html" title="The SQL++ Query Language">
137 <i class="none"></i>
138 The SQL++ Query Language</a>
139 </li>
140
141 <li>
142
143 <a href="sqlpp/builtins.html" title="Builtin Functions">
144 <i class="none"></i>
145 Builtin Functions</a>
146 </li>
147 <li class="nav-header">Queries - AQL</li>
148
149 <li>
150
151 <a href="aql/manual.html" title="The Asterix Query Language (AQL)">
152 <i class="none"></i>
153 The Asterix Query Language (AQL)</a>
154 </li>
155
156 <li>
157
158 <a href="aql/builtins.html" title="Builtin Functions">
159 <i class="none"></i>
160 Builtin Functions</a>
161 </li>
162 <li class="nav-header">API/SDK</li>
163
164 <li>
165
166 <a href="api.html" title="HTTP API">
167 <i class="none"></i>
168 HTTP API</a>
169 </li>
170
171 <li>
172
173 <a href="csv.html" title="CSV Output">
174 <i class="none"></i>
175 CSV Output</a>
176 </li>
177 <li class="nav-header">Advanced Features</li>
178
179 <li>
180
181 <a href="aql/fulltext.html" title="Support of Full-text Queries">
182 <i class="none"></i>
183 Support of Full-text Queries</a>
184 </li>
185
186 <li>
187
188 <a href="aql/externaldata.html" title="Accessing External Data">
189 <i class="none"></i>
190 Accessing External Data</a>
191 </li>
192
193 <li>
194
195 <a href="feeds/tutorial.html" title="Support for Data Ingestion">
196 <i class="none"></i>
197 Support for Data Ingestion</a>
198 </li>
199
200 <li class="active">
201
202 <a href="#"><i class="none"></i>User Defined Functions</a>
203 </li>
204
205 <li>
206
207 <a href="aql/filters.html" title="Filter-Based LSM Index Acceleration">
208 <i class="none"></i>
209 Filter-Based LSM Index Acceleration</a>
210 </li>
211
212 <li>
213
214 <a href="aql/similarity.html" title="Support of Similarity Queries">
215 <i class="none"></i>
216 Support of Similarity Queries</a>
217 </li>
218 </ul>
219
220
221
222 <hr class="divider" />
223
224 <div id="poweredBy">
225 <div class="clear"></div>
226 <div class="clear"></div>
227 <div class="clear"></div>
228 <a href="./" title="AsterixDB" class="builtBy">
229 <img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" />
230 </a>
231 </div>
232 </div>
233 </div>
234
235
236 <div id="bodyColumn" class="span9" >
237
238 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
239 ! or more contributor license agreements. See the NOTICE file
240 ! distributed with this work for additional information
241 ! regarding copyright ownership. The ASF licenses this file
242 ! to you under the Apache License, Version 2.0 (the
243 ! "License"); you may not use this file except in compliance
244 ! with the License. You may obtain a copy of the License at
245 !
246 ! http://www.apache.org/licenses/LICENSE-2.0
247 !
248 ! Unless required by applicable law or agreed to in writing,
249 ! software distributed under the License is distributed on an
250 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
251 ! KIND, either express or implied. See the License for the
252 ! specific language governing permissions and limitations
253 ! under the License.
254 ! --><h1>Support for User Defined Functions in AsterixDB</h1>
255<div class="section">
256<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
257
258<ul>
259
260<li><a href="#PreprocessingCollectedData">Using UDF to preprocess feed-collected data</a></li>
261
262<li><a href="#WritingAnExternalUDF">Writing an External UDF</a></li>
263
264<li><a href="#CreatingAnAsterixDBLibrary">Creating an AsterixDB Library</a></li>
265
266<li><a href="#installingUDF">Installing an AsterixDB Library</a></li>
267</ul>
268<p>In this document, we describe the support for implementing, using, and installing user-defined functions (UDF) in AsterixDB. We will explain how we can use UDFs to preprocess, e.g., data collected using feeds (see the <a href="feeds/tutorial.html">feeds tutorial</a>).</p>
269<div class="section">
270<h3><a name="Installing_an_AsterixDB_Library"></a><a name="installingUDF">Installing an AsterixDB Library</a></h3>
271<p>We assume you have followed the <a href="../install.html">installation instructions</a> to set up a running AsterixDB instance. Let us refer your AsterixDB instance by the name &#x201c;my_asterix&#x201d;.</p>
272
273<ul>
274
275<li>
276<p>Step 1: Stop the AsterixDB instance if it is in the ACTIVE state.</p>
277
278<div class="source">
279<div class="source">
280<pre>$ managix stop -n my_asterix
281</pre></div></div></li>
282
283<li>
284<p>Step 2: Install the library using Managix install command. Just to illustrate, we use the help command to look up the syntax</p>
285
286<div class="source">
287<div class="source">
288<pre>$ managix help -cmd install
289Installs a library to an asterix instance.
290Options
291n Name of Asterix Instance
292d Name of the dataverse under which the library will be installed
293l Name of the library
294p Path to library zip bundle
295</pre></div></div></li>
296</ul>
297<p>Above is a sample output and explains the usage and the required parameters. Each library has a name and is installed under a dataverse. Recall that we had created a dataverse by the name - &#x201c;feeds&#x201d; prior to creating our datatypes and dataset. We shall name our library - &#x201c;testlib&#x201d;.</p>
298<p>We assume you have a library zip bundle that needs to be installed. To install the library, use the Managix install command. An example is shown below.</p>
299
300<div class="source">
301<div class="source">
302<pre> $ managix install -n my_asterix -d feeds -l testlib -p extlibs/asterix-external-data-0.8.7-binary-assembly.zip
303</pre></div></div>
304<p>You should see the following message:</p>
305
306<div class="source">
307<div class="source">
308<pre> INFO: Installed library testlib
309</pre></div></div>
310<p>We shall next start our AsterixDB instance using the start command as shown below.</p>
311
312<div class="source">
313<div class="source">
314<pre> $ managix start -n my_asterix
315</pre></div></div>
316<p>You may now use the AsterixDB library in AQL statements and queries. To look at the installed artifacts, you may execute the following query at the AsterixDB web-console.</p>
317
318<div class="source">
319<div class="source">
320<pre> for $x in dataset Metadata.Function
321 return $x
322
323 for $x in dataset Metadata.Library
324 return $x
325</pre></div></div>
326<p>Our library is now installed and is ready to be used.</p></div></div>
327<div class="section">
328<h2><a name="Preprocessing_Collected_Data"></a><a name="PreprocessingCollectedData" id="PreprocessingCollectedData">Preprocessing Collected Data</a></h2>
329<p>In the following we assume that you already created the <tt>TwitterFeed</tt> and its corresponding data types and dataset following the instruction explained in the <a href="feeds/tutorial.html">feeds tutorial</a>.</p>
330<p>A feed definition may optionally include the specification of a user-defined function that is to be applied to each feed object prior to persistence. Examples of pre-processing might include adding attributes, filtering out objects, sampling, sentiment analysis, feature extraction, etc. We can express a UDF, which can be defined in AQL or in a programming language such as Java, to perform such pre-processing. An AQL UDF is a good fit when pre-processing a object requires the result of a query (join or aggregate) over data contained in AsterixDB datasets. More sophisticated processing such as sentiment analysis of text is better handled by providing a Java UDF. A Java UDF has an initialization phase that allows the UDF to access any resources it may need to initialize itself prior to being used in a data flow. It is assumed by the AsterixDB compiler to be stateless and thus usable as an embarrassingly parallel black box. In contrast, the AsterixDB compiler can reason about an AQL UDF and involve the use of indexes during its invocation.</p>
331<p>We consider an example transformation of a raw tweet into its lightweight version called <tt>ProcessedTweet</tt>, which is defined next.</p>
332
333<div class="source">
334<div class="source">
335<pre> use dataverse feeds;
336
337 create type ProcessedTweet if not exists as open {
338 id: string,
339 user_name:string,
340 location:point,
341 created_at:string,
342 message_text:string,
343 country: string,
344 topics: {{string}}
345 };
346
347 create dataset ProcessedTweets(ProcessedTweet)
348 primary key id;
349</pre></div></div>
350<p>The processing required in transforming a collected tweet to its lighter version of type <tt>ProcessedTweet</tt> involves extracting the topics or hash-tags (if any) in a tweet and collecting them in the referred &#x201c;topics&#x201d; attribute for the tweet. Additionally, the latitude and longitude values (doubles) are combined into the spatial point type. Note that spatial data types are considered as first-class citizens that come with the support for creating indexes. Next we show a revised version of our example TwitterFeed that involves the use of a UDF. We assume that the UDF that contains the transformation logic into a &#x201c;ProcessedTweet&#x201d; is available as a Java UDF inside an AsterixDB library named &#x2018;testlib&#x2019;. We defer the writing of a Java UDF and its installation as part of an AsterixDB library to a later section of this document.</p>
351
352<div class="source">
353<div class="source">
354<pre> use dataverse feeds;
355
356 create feed ProcessedTwitterFeed if not exists
357 using &quot;push_twitter&quot;
358 ((&quot;type-name&quot;=&quot;Tweet&quot;),
359 (&quot;consumer.key&quot;=&quot;************&quot;),
360 (&quot;consumer.secret&quot;=&quot;**************&quot;),
361 (&quot;access.token&quot;=&quot;**********&quot;),
362 (&quot;access.token.secret&quot;=&quot;*************&quot;))
363
364 apply function testlib#addHashTagsInPlace;
365</pre></div></div>
366<p>Note that a feed adaptor and a UDF act as pluggable components. These contribute towards providing a generic &#x201c;plug-and-play&#x201d; model where custom implementations can be provided to cater to specific requirements.</p>
367<div class="section">
368<div class="section">
369<h4><a name="Building_a_Cascade_Network_of_Feeds"></a>Building a Cascade Network of Feeds</h4>
370<p>Multiple high-level applications may wish to consume the data ingested from a data feed. Each such application might perceive the feed in a different way and require the arriving data to be processed and/or persisted differently. Building a separate flow of data from the external source for each application is wasteful of resources as the pre-processing or transformations required by each application might overlap and could be done together in an incremental fashion to avoid redundancy. A single flow of data from the external source could provide data for multiple applications. To achieve this, we introduce the notion of primary and secondary feeds in AsterixDB.</p>
371<p>A feed in AsterixDB is considered to be a primary feed if it gets its data from an external data source. The objects contained in a feed (subsequent to any pre-processing) are directed to a designated AsterixDB dataset. Alternatively or additionally, these objects can be used to derive other feeds known as secondary feeds. A secondary feed is similar to its parent feed in every other aspect; it can have an associated UDF to allow for any subsequent processing, can be persisted into a dataset, and/or can be made to derive other secondary feeds to form a cascade network. A primary feed and a dependent secondary feed form a hierarchy. As an example, we next show an example AQL statement that redefines the previous feed &#x201c;ProcessedTwitterFeed&#x201d; in terms of their respective parent feed (TwitterFeed).</p>
372
373<div class="source">
374<div class="source">
375<pre> use dataverse feeds;
376
377 drop feed ProcessedTwitterFeed if exists;
378
379 create secondary feed ProcessedTwitterFeed from feed TwitterFeed
380 apply function testlib#addHashTags;
381
382 connect feed ProcessedTwitterFeed to dataset ProcessedTweets;
383</pre></div></div>
384<p>The <tt>addHashTags</tt> function is already provided in the example UDF.To see what objects are being inserted into the dataset, we can perform a simple dataset scan after allowing a few moments for the feed to start ingesting data:</p>
385
386<div class="source">
387<div class="source">
388<pre> use dataverse feeds;
389
390 for $i in dataset ProcessedTweets limit 10 return $i;
391</pre></div></div>
392<p>For an example of how to write a Java UDF from scratch, the source for the example UDF that has been used in this tutorial is available <a class="externalLink" href="https://github.com/apache/asterixdb/tree/master/asterixdb/asterix-external-data/src/test/java/org/apache/asterix/external/library">here</a></p></div></div></div>
393<div class="section">
394<h2><a name="Unstalling_an_AsterixDB_Library"></a><a name="installingUDF">Unstalling an AsterixDB Library</a></h2>
395<p>To uninstall a library, use the Managix uninstall command as follows:</p>
396
397<div class="source">
398<div class="source">
399<pre> $ managix stop -n my_asterix
400
401 $ managix uninstall -n my_asterix -d feeds -l testlib
402</pre></div></div></div>
403 </div>
404 </div>
405 </div>
406
407 <hr/>
408
409 <footer>
410 <div class="container-fluid">
411 <div class="row span12">Copyright &copy; 2018
412 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
413 All Rights Reserved.
414
415 </div>
416
417 <?xml version="1.0" encoding="UTF-8"?>
418<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
419 feather logo, and the Apache AsterixDB project logo are either
420 registered trademarks or trademarks of The Apache Software
421 Foundation in the United States and other countries.
422 All other marks mentioned may be trademarks or registered
423 trademarks of their respective owners.</div>
424
425
426 </div>
427 </footer>
428 </body>
429</html>