blob: c6000c19d7538cae40f7b5c03f9490ce257eabcc [file] [log] [blame]
Ian Maxon444ca1b2017-08-25 11:41:41 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2017-07-27
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20170727" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; Support for Data Ingestion in AsterixDB</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
22<script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
23 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
24 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
25 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
26
27 ga('create', 'UA-41536543-1', 'uci.edu');
28 ga('send', 'pageview');</script>
29
30 </head>
31 <body class="topBarDisabled">
32
33
34
35
36 <div class="container-fluid">
37 <div id="banner">
38 <div class="pull-left">
39 <a href=".././" id="bannerLeft">
40 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
41 </a>
42 </div>
43 <div class="pull-right"> </div>
44 <div class="clear"><hr/></div>
45 </div>
46
47 <div id="breadcrumbs">
48 <ul class="breadcrumb">
49
50
51 <li id="publishDate">Last Published: 2017-07-27</li>
52
53
54
55 <li id="projectVersion" class="pull-right">Version: 0.9.2-SNAPSHOT</li>
56
57 <li class="divider pull-right">|</li>
58
59 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
60 Documentation Home</a>
61 </li>
62
63 </ul>
64 </div>
65
66
67 <div class="row-fluid">
68 <div id="leftColumn" class="span3">
69 <div class="well sidebar-nav">
70
71
72 <ul class="nav nav-list">
73 <li class="nav-header">Get Started - Installation</li>
74
75 <li>
76
77 <a href="../ncservice.html" title="Option 1: using NCService">
78 <i class="none"></i>
79 Option 1: using NCService</a>
80 </li>
81
82 <li>
83
84 <a href="../ansible.html" title="Option 2: using Ansible">
85 <i class="none"></i>
86 Option 2: using Ansible</a>
87 </li>
88
89 <li>
90
91 <a href="../aws.html" title="Option 3: using Amazon Web Services">
92 <i class="none"></i>
93 Option 3: using Amazon Web Services</a>
94 </li>
95
96 <li>
97
98 <a href="../yarn.html" title="Option 4: using YARN">
99 <i class="none"></i>
100 Option 4: using YARN</a>
101 </li>
102
103 <li>
104
105 <a href="../install.html" title="Option 5: using Managix (deprecated)">
106 <i class="none"></i>
107 Option 5: using Managix (deprecated)</a>
108 </li>
109 <li class="nav-header">AsterixDB Primer</li>
110
111 <li>
112
113 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
114 <i class="none"></i>
115 Option 1: using SQL++</a>
116 </li>
117
118 <li>
119
120 <a href="../aql/primer.html" title="Option 2: using AQL">
121 <i class="none"></i>
122 Option 2: using AQL</a>
123 </li>
124 <li class="nav-header">Data Model</li>
125
126 <li>
127
128 <a href="../datamodel.html" title="The Asterix Data Model">
129 <i class="none"></i>
130 The Asterix Data Model</a>
131 </li>
132 <li class="nav-header">Queries - SQL++</li>
133
134 <li>
135
136 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
137 <i class="none"></i>
138 The SQL++ Query Language</a>
139 </li>
140
141 <li>
142
143 <a href="../sqlpp/builtins.html" title="Builtin Functions">
144 <i class="none"></i>
145 Builtin Functions</a>
146 </li>
147 <li class="nav-header">Queries - AQL</li>
148
149 <li>
150
151 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
152 <i class="none"></i>
153 The Asterix Query Language (AQL)</a>
154 </li>
155
156 <li>
157
158 <a href="../aql/builtins.html" title="Builtin Functions">
159 <i class="none"></i>
160 Builtin Functions</a>
161 </li>
162 <li class="nav-header">API/SDK</li>
163
164 <li>
165
166 <a href="../api.html" title="HTTP API">
167 <i class="none"></i>
168 HTTP API</a>
169 </li>
170
171 <li>
172
173 <a href="../csv.html" title="CSV Output">
174 <i class="none"></i>
175 CSV Output</a>
176 </li>
177 <li class="nav-header">Advanced Features</li>
178
179 <li>
180
181 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
182 <i class="none"></i>
183 Support of Full-text Queries</a>
184 </li>
185
186 <li>
187
188 <a href="../aql/externaldata.html" title="Accessing External Data">
189 <i class="none"></i>
190 Accessing External Data</a>
191 </li>
192
193 <li class="active">
194
195 <a href="#"><i class="none"></i>Support for Data Ingestion</a>
196 </li>
197
198 <li>
199
200 <a href="../udf.html" title="User Defined Functions">
201 <i class="none"></i>
202 User Defined Functions</a>
203 </li>
204
205 <li>
206
207 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
208 <i class="none"></i>
209 Filter-Based LSM Index Acceleration</a>
210 </li>
211
212 <li>
213
214 <a href="../aql/similarity.html" title="Support of Similarity Queries">
215 <i class="none"></i>
216 Support of Similarity Queries</a>
217 </li>
218 </ul>
219
220
221
222 <hr class="divider" />
223
224 <div id="poweredBy">
225 <div class="clear"></div>
226 <div class="clear"></div>
227 <div class="clear"></div>
228 <a href=".././" title="AsterixDB" class="builtBy">
229 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
230 </a>
231 </div>
232 </div>
233 </div>
234
235
236 <div id="bodyColumn" class="span9" >
237
238 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
239 ! or more contributor license agreements. See the NOTICE file
240 ! distributed with this work for additional information
241 ! regarding copyright ownership. The ASF licenses this file
242 ! to you under the Apache License, Version 2.0 (the
243 ! "License"); you may not use this file except in compliance
244 ! with the License. You may obtain a copy of the License at
245 !
246 ! http://www.apache.org/licenses/LICENSE-2.0
247 !
248 ! Unless required by applicable law or agreed to in writing,
249 ! software distributed under the License is distributed on an
250 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
251 ! KIND, either express or implied. See the License for the
252 ! specific language governing permissions and limitations
253 ! under the License.
254 ! --><h1>Support for Data Ingestion in AsterixDB</h1>
255<div class="section">
256<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
257
258<ul>
259
260<li><a href="#Introduction">Introduction</a></li>
261
262<li><a href="#FeedAdapters">Feed Adapters</a> <!-- * [Feed Policies](#FeedPolicies) --></li>
263</ul></div>
264<div class="section">
265<h2><a name="Introduction">Introduction</a></h2>
266<p>In this document, we describe the support for data ingestion in AsterixDB. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users&#x2019; lives and system performance.</p></div>
267<div class="section">
268<h2><a name="Feed_Adapters"></a><a name="FeedAdapters">Feed Adapters</a></h2>
269<p>The functionality of establishing a connection with a data source and receiving, parsing and translating its data into ADM objects (for storage inside AsterixDB) is contained in a feed adapter. A feed adapter is an implementation of an interface and its details are specific to a given data source. An adapter may optionally be given parameters to configure its runtime behavior. Depending upon the data transfer protocol/APIs offered by the data source, a feed adapter may operate in a push or a pull mode. Push mode involves just one initial request by the adapter to the data source for setting up the connection. Once a connection is authorized, the data source &#x201c;pushes&#x201d; data to the adapter without any subsequent requests by the adapter. In contrast, when operating in a pull mode, the adapter makes a separate request each time to receive data. AsterixDB currently provides built-in adapters for several popular data sources such as Twitter and RSS feeds. AsterixDB additionally provides a generic socket-based adapter that can be used to ingest data that is directed at a prescribed socket.</p>
270<p>In this tutorial, we shall describe building two example data ingestion pipelines that cover the popular scenarios of ingesting data from (a) Twitter (b) RSS (c) Socket Feed source.</p>
271<div class="section">
272<div class="section">
273<h4><a name="Ingesting_Twitter_Stream"></a>Ingesting Twitter Stream</h4>
274<p>We shall use the built-in push-based Twitter adapter. As a pre-requisite, we must define a Tweet using the AsterixDB Data Model (ADM) and the AsterixDB Query Language (AQL). Given below are the type definitions in AQL that create a Tweet datatype which is representative of a real tweet as obtained from Twitter.</p>
275
276<div class="source">
277<div class="source">
278<pre> create dataverse feeds;
279 use dataverse feeds;
280
281 create type TwitterUser as closed {
282 screen_name: string,
283 lang: string,
284 friends_count: int32,
285 statuses_count: int32
286 };
287
288 create type Tweet as open {
289 id: int64,
290 user: TwitterUser
291 }
292
293 create dataset Tweets (Tweet)
294 primary key id;
295</pre></div></div>
296<p>We also create a dataset that we shall use to persist the tweets in AsterixDB. Next we make use of the <tt>create feed</tt> AQL statement to define our example data feed.</p>
297<div class="section">
298<h5><a name="Using_the_push_twitter_feed_adapter"></a>Using the &#x201c;push_twitter&#x201d; feed adapter</h5>
299<p>The &#x201c;push_twitter&#x201d; adapter requires setting up an application account with Twitter. To retrieve tweets, Twitter requires registering an application. Registration involves providing a name and a brief description for the application. Each application has associated OAuth authentication credentials that include OAuth keys and tokens. Accessing the Twitter API requires providing the following. 1. Consumer Key (API Key) 2. Consumer Secret (API Secret) 3. Access Token 4. Access Token Secret</p>
300<p>The &#x201c;push_twitter&#x201d; adapter takes as configuration the above mentioned parameters. End users are required to obtain the above authentication credentials prior to using the &#x201c;push_twitter&#x201d; adapter. For further information on obtaining OAuth keys and tokens and registering an application with Twitter, please visit <a class="externalLink" href="http://apps.twitter.com">http://apps.twitter.com</a></p>
301<p>Given below is an example AQL statement that creates a feed called &#x201c;TwitterFeed&#x201d; by using the &#x201c;push_twitter&#x201d; adapter.</p>
302
303<div class="source">
304<div class="source">
305<pre> use dataverse feeds;
306
307 create feed TwitterFeed if not exists using &quot;push_twitter&quot;
308 ((&quot;type-name&quot;=&quot;Tweet&quot;),
309 (&quot;format&quot;=&quot;twitter-status&quot;),
310 (&quot;consumer.key&quot;=&quot;************&quot;),
311 (&quot;consumer.secret&quot;=&quot;**************&quot;),
312 (&quot;access.token&quot;=&quot;**********&quot;),
313 (&quot;access.token.secret&quot;=&quot;*************&quot;));
314</pre></div></div>
315<p>It is required that the above authentication parameters are provided valid. Note that the <tt>create feed</tt> statement does not initiate the flow of data from Twitter into the AsterixDB instance. Instead, the <tt>create feed</tt> statement only results in registering the feed with the instance. The flow of data along a feed is initiated when it is connected to a target dataset using the connect feed statement and activated using the start feed statement.</p>
316<p>The Twitter adapter also supports several Twitter streaming APIs as follow:</p>
317
318<ol style="list-style-type: decimal">
319
320<li>Track filter (&#x201c;keywords&#x201d;=&#x201c;AsterixDB, Apache&#x201d;)</li>
321
322<li>Locations filter (&#x201c;locations&#x201d;=&#x201c;-29.7, 79.2, 36.7, 72.0; -124.848974,-66.885444, 24.396308, 49.384358&#x201d;)</li>
323
324<li>Language filter (&#x201c;language&#x201d;=&#x201c;en&#x201d;)</li>
325
326<li>Filter level (&#x201c;filter-level&#x201d;=&#x201c;low&#x201d;)</li>
327</ol>
328<p>An example of Twitter adapter tracking tweets with keyword &#x201c;news&#x201d; can be described using following ddl:</p>
329
330<div class="source">
331<div class="source">
332<pre> use dataverse feeds;
333
334 create feed TwitterFeed if not exists using &quot;push_twitter&quot;
335 ((&quot;type-name&quot;=&quot;Tweet&quot;),
336 (&quot;format&quot;=&quot;twitter-status&quot;),
337 (&quot;consumer.key&quot;=&quot;************&quot;),
338 (&quot;consumer.secret&quot;=&quot;**************&quot;),
339 (&quot;access.token&quot;=&quot;**********&quot;),
340 (&quot;access.token.secret&quot;=&quot;*************&quot;),
341 (&quot;keywords&quot;=&quot;news&quot;));
342</pre></div></div>
343<p>For more details about these APIs, please visit <a class="externalLink" href="https://dev.twitter.com/streaming/overview/request-parameters">https://dev.twitter.com/streaming/overview/request-parameters</a></p></div></div>
344<div class="section">
345<h4><a name="Lifecycle_of_a_Feed"></a>Lifecycle of a Feed</h4>
346<p>A feed is a logical artifact that is brought to life (i.e., its data flow is initiated) only when it is activated using the <tt>start feed</tt> statement. Before we active a feed, we need to designate the dataset where the data to be persisted using <tt>connect feed</tt> statement. Subsequent to a <tt>connect feed</tt> statement, the feed is said to be in the connected state. After that, <tt>start feed</tt> statement will activate the feed, and start the dataflow from feed to its connected dataset. Multiple feeds can simultaneously be connected to a dataset such that the contents of the dataset represent the union of the connected feeds. Also one feed can be simultaneously connected to multiple target datasets.</p>
347
348<div class="source">
349<div class="source">
350<pre> use dataverse feeds;
351
352 connect feed TwitterFeed to dataset Tweets;
353
354 start feed TwitterFeed;
355</pre></div></div>
356<p>The <tt>connect feed</tt> statement above directs AsterixDB to persist the data from <tt>TwitterFeed</tt> feed into the <tt>Tweets</tt> dataset. The <tt>start feed</tt> statement will activate the feed and start the dataflow. If it is required (by the high-level application) to also retain the raw tweets obtained from Twitter, the end user may additionally choose to connect TwitterFeed to a different dataset.</p>
357<p>Let the feed run for a minute, then run the following query to see the latest tweets that are stored into the data set.</p>
358
359<div class="source">
360<div class="source">
361<pre> use dataverse feeds;
362
363 for $i in dataset Tweets limit 10 return $i;
364</pre></div></div>
365<p>The dataflow of data from a feed can be terminated explicitly by <tt>stop feed</tt> statement.</p>
366
367<div class="source">
368<div class="source">
369<pre> use dataverse feeds;
370
371 stop feed TwitterFeed;
372</pre></div></div>
373<p>The <tt>disconnnect statement</tt> can be used to disconnect the feed from certain dataset.</p>
374
375<div class="source">
376<div class="source">
377<pre> use dataverse feeds;
378
379 disconnect feed TwitterFeed from dataset Tweets;
380</pre></div></div></div></div>
381<div class="section">
382<h3><a name="Ingesting_with_Other_Adapters"></a>Ingesting with Other Adapters</h3>
383<p>AsterixDB has several builtin feed adapters for data ingestion. User can also implement their own adapters and plug them into AsterixDB. Here we introduce <tt>rss_feed</tt>, <tt>socket_adapter</tt> and <tt>localfs</tt> feed adapter that cover most of the common application scenarios.</p>
384<div class="section">
385<div class="section">
386<h5><a name="Using_the_rss_feed_feed_adapter"></a>Using the &#x201c;rss_feed&#x201d; feed adapter</h5>
387<p><tt>rss_feed</tt> adapter allows retrieving data given a collection of RSS end point URLs. As observed in the case of ingesting tweets, it is required to model an RSS data item using AQL.</p>
388
389<div class="source">
390<div class="source">
391<pre> use dataverse feeds;
392
393 create type Rss if not exists as open {
394 id: string,
395 title: string,
396 description: string,
397 link: string
398 };
399
400 create dataset RssDataset (Rss)
401 primary key id;
402</pre></div></div>
403<p>Next, we define an RSS feed using our built-in adapter &#x201c;rss_feed&#x201d;.</p>
404
405<div class="source">
406<div class="source">
407<pre> use dataverse feeds;
408
409 create feed my_feed using
410 rss_feed (
411 (&quot;type-name&quot;=&quot;Rss&quot;),
412 (&quot;format&quot;=&quot;rss&quot;),
413 (&quot;url&quot;=&quot;http://rss.cnn.com/rss/edition.rss&quot;)
414 );
415</pre></div></div>
416<p>In the above definition, the configuration parameter &#x201c;url&#x201d; can be a comma-separated list that reflects a collection of RSS URLs, where each URL corresponds to an RSS endpoint or an RSS feed. The &#x201c;rss_feed&#x201d; retrieves data from each of the specified RSS URLs (comma separated values) in parallel.</p>
417<p>The following statements connect the feed into the <tt>RssDataset</tt>:</p>
418
419<div class="source">
420<div class="source">
421<pre> use dataverse feeds;
422
423 connect feed my_feed to dataset RssDataset;
424</pre></div></div>
425<p>The following statements activate the feed and start the dataflow:</p>
426
427<div class="source">
428<div class="source">
429<pre> use dataverse feeds;
430
431 start feed my_feed;
432</pre></div></div>
433<p>The following statements show the latest data from the data set, stop the feed, and disconnect the feed from the data set.</p>
434
435<div class="source">
436<div class="source">
437<pre> use dataverse feeds;
438
439 for $i in dataset RssDataset limit 10 return $i;
440
441 stop feed my_feed
442
443 disconnect feed my_feed from dataset RssDataset;
444</pre></div></div></div>
445<div class="section">
446<h5><a name="Using_the_socket_adapter_feed_adapter"></a>Using the &#x201c;socket_adapter&#x201d; feed adapter</h5>
447<p><tt>socket_adapter</tt> feed opens a web socket on the given node which allows user to push data into AsterixDB directly. Here is an example:</p>
448
449<div class="source">
450<div class="source">
451<pre> drop dataverse feeds if exists;
452 create dataverse feeds;
453 use dataverse feeds;
454
455 create type TestDataType as open {
456 screenName: string
457 }
458
459 create dataset TestDataset(TestDataType) primary key screenName;
460
461 create feed TestSocketFeed using socket_adapter
462 (
463 (&quot;sockets&quot;=&quot;127.0.0.1:10001&quot;),
464 (&quot;address-type&quot;=&quot;IP&quot;),
465 (&quot;type-name&quot;=&quot;TestDataType&quot;),
466 (&quot;format&quot;=&quot;adm&quot;)
467 );
468
469 connect feed TestSocketFeed to dataset TestDataset;
470
471 use dataverse feeds;
472 start feed TestSocketFeed;
473</pre></div></div>
474<p>The above statements create a socket feed which is listening to &#x201c;10001&#x201d; port of the host machine. This feed accepts data records in &#x201c;adm&#x201d; format. As an example, you can download the sample dataset <a href="../data/chu.adm">Chirp Users</a> and push them line by line into the socket feed using any socket client you like. Following is a socket client example in Python:</p>
475
476<div class="source">
477<div class="source">
478<pre> from socket import socket
479
480 ip = '127.0.0.1'
481 port1 = 10001
482 filePath = 'chu.adm'
483
484 sock1 = socket()
485 sock1.connect((ip, port1))
486
487 with open(filePath) as inputData:
488 for line in inputData:
489 sock1.sendall(line)
490 sock1.close()
491</pre></div></div></div></div>
492<div class="section">
493<h4><a name="Using_the_localfs_feed_adapter"></a>Using the &#x201c;localfs&#x201d; feed adapter</h4>
494<p><tt>localfs</tt> adapter enables data ingestion from local file system. It allows user to feed data records on local disk into a dataset. A DDL example for creating a <tt>localfs</tt> feed is given as follow:</p>
495
496<div class="source">
497<div class="source">
498<pre> use dataverse feeds;
499
500 create type TweetType as closed {
501 id: string,
502 username : string,
503 location : string,
504 text : string,
505 timestamp : string
506 }
507
508 create dataset Tweets(TweetType)
509 primary key id;
510
511 create feed TweetFeed
512 using localfs
513 ((&quot;type-name&quot;=&quot;TweetType&quot;),(&quot;path&quot;=&quot;HOSTNAME://LOCAL_FILE_PATH&quot;),(&quot;format&quot;=&quot;adm&quot;))
514</pre></div></div>
515<p>Similar to previous examples, we need to define the datatype and dataset this feed uses. The &#x201c;path&#x201d; parameter refers to the local datafile that we want to ingest data from. <tt>HOSTNAME</tt> can either be the IP address or node name of the machine which holds the file. <tt>LOCAL_FILE_PATH</tt> indicates the absolute path to the file on that machine. Similarly to <tt>socket_adapter</tt>, this feed takes <tt>adm</tt> formatted data records.</p></div></div>
516<div class="section">
517<h3><a name="Datatype_for_feed_and_target_dataset"></a>Datatype for feed and target dataset</h3>
518<p>The &#x201c;type-name&#x201d; parameter in create feed statement defines the <tt>datatype</tt> of the datasource. In most use cases, feed will have the same <tt>datatype</tt> as the target dataset. However, if we want to perform certain preprocess before the data records gets into the target dataset (append autogenerated key, apply user defined functions, etc.), we will need to define the datatypes for feed and dataset separately.</p>
519<div class="section">
520<h4><a name="Ingestion_with_autogenerated_key"></a>Ingestion with autogenerated key</h4>
521<p>AsterixDB supports using autogenerated uuid as the primary key for dataset. When we use this feature, we will need to define a datatype with the primary key field, and specify that field to be autogenerated when creating the dataset. Use that same datatype in feed definition will cause a type discrepancy since there is no such field in the datasource. Thus, we will need to define two separate datatypes for feed and dataset:</p>
522
523<div class="source">
524<div class="source">
525<pre> use dataverse feeds;
526
527 create type DBLPFeedType as closed {
528 dblpid: string,
529 title: string,
530 authors: string,
531 misc: string
532 }
533
534 create type DBLPDataSetType as open {
535 id: uuid,
536 dblpid: string,
537 title: string,
538 authors: string,
539 misc: string
540 }
541 create dataset DBLPDataset(DBLPDataSetType) primary key id autogenerated;
542
543 create feed DBLPFeed using socket_adapter
544 (
545 (&quot;sockets&quot;=&quot;127.0.0.1:10001&quot;),
546 (&quot;address-type&quot;=&quot;IP&quot;),
547 (&quot;type-name&quot;=&quot;DBLPFeedType&quot;),
548 (&quot;format&quot;=&quot;adm&quot;)
549 );
550
551 connect feed DBLPFeed to dataset DBLPDataset;
552
553 start feed DBLPFeed;
554</pre></div></div></div></div></div>
555<div class="section">
556<h2><a name="Policies_for_Feed_Ingestion"></a><a name="FeedPolicies">Policies for Feed Ingestion</a></h2>
557<p>Multiple feeds may be concurrently operational on an AsterixDB cluster, each competing for resources (CPU cycles, network bandwidth, disk IO) to maintain pace with their respective data sources. As a data management system, AsterixDB is able to manage a set of concurrent feeds and make dynamic decisions related to the allocation of resources, resolving resource bottlenecks and the handling of failures. Each feed has its own set of constraints, influenced largely by the nature of its data source and the applications that intend to consume and process the ingested data. Consider an application that intends to discover the trending topics on Twitter by analyzing tweets that are being processed. Losing a few tweets may be acceptable. In contrast, when ingesting from a data source that provides a click-stream of ad clicks, losing data would translate to a loss of revenue for an application that tracks revenue by charging advertisers per click.</p>
558<p>AsterixDB allows a data feed to have an associated ingestion policy that is expressed as a collection of parameters and associated values. An ingestion policy dictates the runtime behavior of the feed in response to resource bottlenecks and failures. AsterixDB provides a set of policies that help customize the system&#x2019;s runtime behavior when handling excess objects.</p>
559<div class="section">
560<div class="section">
561<h4><a name="Policies"></a>Policies</h4>
562
563<ul>
564
565<li>
566<p><i>Spill</i>: Objects that cannot be processed by an operator for lack of resources (referred to as excess objects hereafter) should be persisted to the local disk for deferred processing.</p></li>
567
568<li>
569<p><i>Discard</i>: Excess objects should be discarded.</p></li>
570</ul>
571<p>Note that the end user may choose to form a custom policy. For example, it is possible in AsterixDB to create a custom policy that spills excess objects to disk and subsequently resorts to throttling if the spillage crosses a configured threshold. In all cases, the desired ingestion policy is specified as part of the <tt>connect feed</tt> statement or else the &#x201c;Basic&#x201d; policy will be chosen as the default.</p>
572
573<div class="source">
574<div class="source">
575<pre> use dataverse feeds;
576
577 connect feed TwitterFeed to dataset Tweets
578 using policy Basic;
579</pre></div></div></div></div></div>
580 </div>
581 </div>
582 </div>
583
584 <hr/>
585
586 <footer>
587 <div class="container-fluid">
588 <div class="row span12">Copyright &copy; 2017
589 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
590 All Rights Reserved.
591
592 </div>
593
594 <?xml version="1.0" encoding="UTF-8"?>
595<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
596 feather logo, and the Apache AsterixDB project logo are either
597 registered trademarks or trademarks of The Apache Software
598 Foundation in the United States and other countries.
599 All other marks mentioned may be trademarks or registered
600 trademarks of their respective owners.</div>
601
602
603 </div>
604 </footer>
605 </body>
606</html>