blob: 08c098d9b11fb97c0b26d430118c8135a471c742 [file] [log] [blame]
Ian Maxon100cb802017-04-24 18:48:07 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2017-04-24
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20170424" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB &#x2013; Support for Data Ingestion in AsterixDB</title>
13 <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="../css/site.css" />
15 <link rel="stylesheet" href="../css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="../js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21
Ian Maxon100cb802017-04-24 18:48:07 -070022
Ian Maxon100cb802017-04-24 18:48:07 -070023
24 </head>
25 <body class="topBarDisabled">
26
27
28
29
30 <div class="container-fluid">
31 <div id="banner">
32 <div class="pull-left">
33 <a href=".././" id="bannerLeft">
34 <img src="../images/asterixlogo.png" alt="AsterixDB"/>
35 </a>
36 </div>
37 <div class="pull-right"> </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
45 <li id="publishDate">Last Published: 2017-04-24</li>
46
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.9.1</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="../index.html" title="Documentation Home">
54 Documentation Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Get Started - Installation</li>
68
69 <li>
70
71 <a href="../ncservice.html" title="Option 1: using NCService">
72 <i class="none"></i>
73 Option 1: using NCService</a>
74 </li>
75
76 <li>
77
78 <a href="../ansible.html" title="Option 2: using Ansible">
79 <i class="none"></i>
80 Option 2: using Ansible</a>
81 </li>
82
83 <li>
84
85 <a href="../aws.html" title="Option 3: using Amazon Web Services">
86 <i class="none"></i>
87 Option 3: using Amazon Web Services</a>
88 </li>
89
90 <li>
91
92 <a href="../yarn.html" title="Option 4: using YARN">
93 <i class="none"></i>
94 Option 4: using YARN</a>
95 </li>
96
97 <li>
98
99 <a href="../install.html" title="Option 5: using Managix (deprecated)">
100 <i class="none"></i>
101 Option 5: using Managix (deprecated)</a>
102 </li>
103 <li class="nav-header">AsterixDB Primer</li>
104
105 <li>
106
107 <a href="../sqlpp/primer-sqlpp.html" title="Option 1: using SQL++">
108 <i class="none"></i>
109 Option 1: using SQL++</a>
110 </li>
111
112 <li>
113
114 <a href="../aql/primer.html" title="Option 2: using AQL">
115 <i class="none"></i>
116 Option 2: using AQL</a>
117 </li>
118 <li class="nav-header">Data Model</li>
119
120 <li>
121
122 <a href="../datamodel.html" title="The Asterix Data Model">
123 <i class="none"></i>
124 The Asterix Data Model</a>
125 </li>
126 <li class="nav-header">Queries - SQL++</li>
127
128 <li>
129
130 <a href="../sqlpp/manual.html" title="The SQL++ Query Language">
131 <i class="none"></i>
132 The SQL++ Query Language</a>
133 </li>
134
135 <li>
136
137 <a href="../sqlpp/builtins.html" title="Builtin Functions">
138 <i class="none"></i>
139 Builtin Functions</a>
140 </li>
141 <li class="nav-header">Queries - AQL</li>
142
143 <li>
144
145 <a href="../aql/manual.html" title="The Asterix Query Language (AQL)">
146 <i class="none"></i>
147 The Asterix Query Language (AQL)</a>
148 </li>
149
150 <li>
151
152 <a href="../aql/builtins.html" title="Builtin Functions">
153 <i class="none"></i>
154 Builtin Functions</a>
155 </li>
156 <li class="nav-header">API/SDK</li>
157
158 <li>
159
160 <a href="../api.html" title="HTTP API">
161 <i class="none"></i>
162 HTTP API</a>
163 </li>
164
165 <li>
166
167 <a href="../csv.html" title="CSV Output">
168 <i class="none"></i>
169 CSV Output</a>
170 </li>
171 <li class="nav-header">Advanced Features</li>
172
173 <li>
174
175 <a href="../aql/fulltext.html" title="Support of Full-text Queries">
176 <i class="none"></i>
177 Support of Full-text Queries</a>
178 </li>
179
180 <li>
181
182 <a href="../aql/externaldata.html" title="Accessing External Data">
183 <i class="none"></i>
184 Accessing External Data</a>
185 </li>
186
187 <li class="active">
188
189 <a href="#"><i class="none"></i>Support for Data Ingestion</a>
190 </li>
191
192 <li>
193
194 <a href="../udf.html" title="User Defined Functions">
195 <i class="none"></i>
196 User Defined Functions</a>
197 </li>
198
199 <li>
200
201 <a href="../aql/filters.html" title="Filter-Based LSM Index Acceleration">
202 <i class="none"></i>
203 Filter-Based LSM Index Acceleration</a>
204 </li>
205
206 <li>
207
208 <a href="../aql/similarity.html" title="Support of Similarity Queries">
209 <i class="none"></i>
210 Support of Similarity Queries</a>
211 </li>
212 </ul>
213
214
215
216 <hr class="divider" />
217
218 <div id="poweredBy">
219 <div class="clear"></div>
220 <div class="clear"></div>
221 <div class="clear"></div>
222 <a href=".././" title="AsterixDB" class="builtBy">
223 <img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" />
224 </a>
225 </div>
226 </div>
227 </div>
228
229
230 <div id="bodyColumn" class="span9" >
231
232 <!-- ! Licensed to the Apache Software Foundation (ASF) under one
233 ! or more contributor license agreements. See the NOTICE file
234 ! distributed with this work for additional information
235 ! regarding copyright ownership. The ASF licenses this file
236 ! to you under the Apache License, Version 2.0 (the
237 ! "License"); you may not use this file except in compliance
238 ! with the License. You may obtain a copy of the License at
239 !
240 ! http://www.apache.org/licenses/LICENSE-2.0
241 !
242 ! Unless required by applicable law or agreed to in writing,
243 ! software distributed under the License is distributed on an
244 ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
245 ! KIND, either express or implied. See the License for the
246 ! specific language governing permissions and limitations
247 ! under the License.
248 ! --><h1>Support for Data Ingestion in AsterixDB</h1>
249<div class="section">
250<h2><a name="Table_of_Contents"></a><a name="atoc" id="#toc">Table of Contents</a></h2>
251
252<ul>
253
254<li><a href="#Introduction">Introduction</a></li>
255
256<li><a href="#FeedAdapters">Feed Adapters</a> <!-- * [Feed Policies](#FeedPolicies) --></li>
257</ul></div>
258<div class="section">
259<h2><a name="Introduction">Introduction</a></h2>
260<p>In this document, we describe the support for data ingestion in AsterixDB. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users&#x2019; lives and system performance.</p></div>
261<div class="section">
262<h2><a name="Feed_Adapters"></a><a name="FeedAdapters">Feed Adapters</a></h2>
263<p>The functionality of establishing a connection with a data source and receiving, parsing and translating its data into ADM objects (for storage inside AsterixDB) is contained in a feed adapter. A feed adapter is an implementation of an interface and its details are specific to a given data source. An adapter may optionally be given parameters to configure its runtime behavior. Depending upon the data transfer protocol/APIs offered by the data source, a feed adapter may operate in a push or a pull mode. Push mode involves just one initial request by the adapter to the data source for setting up the connection. Once a connection is authorized, the data source &#x201c;pushes&#x201d; data to the adapter without any subsequent requests by the adapter. In contrast, when operating in a pull mode, the adapter makes a separate request each time to receive data. AsterixDB currently provides built-in adapters for several popular data sources such as Twitter and RSS feeds. AsterixDB additionally provides a generic socket-based adapter that can be used to ingest data that is directed at a prescribed socket.</p>
264<p>In this tutorial, we shall describe building two example data ingestion pipelines that cover the popular scenarios of ingesting data from (a) Twitter (b) RSS (c) Socket Feed source.</p>
265<div class="section">
266<div class="section">
267<h4><a name="Ingesting_Twitter_Stream"></a>Ingesting Twitter Stream</h4>
268<p>We shall use the built-in push-based Twitter adapter. As a pre-requisite, we must define a Tweet using the AsterixDB Data Model (ADM) and the AsterixDB Query Language (AQL). Given below are the type definitions in AQL that create a Tweet datatype which is representative of a real tweet as obtained from Twitter.</p>
269
270<div class="source">
271<div class="source">
272<pre> create dataverse feeds;
273 use dataverse feeds;
274
275 create type TwitterUser as closed {
276 screen_name: string,
277 lang: string,
278 friends_count: int32,
279 statuses_count: int32
280 };
281
282 create type Tweet as open {
283 id: int64,
284 user: TwitterUser
285 }
286
287 create dataset Tweets (Tweet)
288 primary key id;
289</pre></div></div>
290<p>We also create a dataset that we shall use to persist the tweets in AsterixDB. Next we make use of the <tt>create feed</tt> AQL statement to define our example data feed.</p>
291<div class="section">
292<h5><a name="Using_the_push_twitter_feed_adapter"></a>Using the &#x201c;push_twitter&#x201d; feed adapter</h5>
293<p>The &#x201c;push_twitter&#x201d; adapter requires setting up an application account with Twitter. To retrieve tweets, Twitter requires registering an application. Registration involves providing a name and a brief description for the application. Each application has associated OAuth authentication credentials that include OAuth keys and tokens. Accessing the Twitter API requires providing the following. 1. Consumer Key (API Key) 2. Consumer Secret (API Secret) 3. Access Token 4. Access Token Secret</p>
294<p>The &#x201c;push_twitter&#x201d; adapter takes as configuration the above mentioned parameters. End users are required to obtain the above authentication credentials prior to using the &#x201c;push_twitter&#x201d; adapter. For further information on obtaining OAuth keys and tokens and registering an application with Twitter, please visit <a class="externalLink" href="http://apps.twitter.com">http://apps.twitter.com</a></p>
295<p>Given below is an example AQL statement that creates a feed called &#x201c;TwitterFeed&#x201d; by using the &#x201c;push_twitter&#x201d; adapter.</p>
296
297<div class="source">
298<div class="source">
299<pre> use dataverse feeds;
300
301 create feed TwitterFeed if not exists using &quot;push_twitter&quot;
302 ((&quot;type-name&quot;=&quot;Tweet&quot;),
303 (&quot;format&quot;=&quot;twitter-status&quot;),
304 (&quot;consumer.key&quot;=&quot;************&quot;),
305 (&quot;consumer.secret&quot;=&quot;**************&quot;),
306 (&quot;access.token&quot;=&quot;**********&quot;),
307 (&quot;access.token.secret&quot;=&quot;*************&quot;));
308</pre></div></div>
309<p>It is required that the above authentication parameters are provided valid. Note that the <tt>create feed</tt> statement does not initiate the flow of data from Twitter into the AsterixDB instance. Instead, the <tt>create feed</tt> statement only results in registering the feed with the instance. The flow of data along a feed is initiated when it is connected to a target dataset using the connect feed statement and activated using the start feed statement.</p>
310<p>The Twitter adapter also supports several Twitter streaming APIs as follow:</p>
311
312<ol style="list-style-type: decimal">
313
314<li>Track filter (&#x201c;keywords&#x201d;=&#x201c;AsterixDB, Apache&#x201d;)</li>
315
316<li>Locations filter (&#x201c;locations&#x201d;=&#x201c;-29.7, 79.2, 36.7, 72.0; -124.848974,-66.885444, 24.396308, 49.384358&#x201d;)</li>
317
318<li>Language filter (&#x201c;language&#x201d;=&#x201c;en&#x201d;)</li>
319
320<li>Filter level (&#x201c;filter-level&#x201d;=&#x201c;low&#x201d;)</li>
321</ol>
322<p>An example of Twitter adapter tracking tweets with keyword &#x201c;news&#x201d; can be described using following ddl:</p>
323
324<div class="source">
325<div class="source">
326<pre> use dataverse feeds;
327
328 create feed TwitterFeed if not exists using &quot;push_twitter&quot;
329 ((&quot;type-name&quot;=&quot;Tweet&quot;),
330 (&quot;format&quot;=&quot;twitter-status&quot;),
331 (&quot;consumer.key&quot;=&quot;************&quot;),
332 (&quot;consumer.secret&quot;=&quot;**************&quot;),
333 (&quot;access.token&quot;=&quot;**********&quot;),
334 (&quot;access.token.secret&quot;=&quot;*************&quot;),
335 (&quot;keywords&quot;=&quot;news&quot;));
336</pre></div></div>
337<p>For more details about these APIs, please visit <a class="externalLink" href="https://dev.twitter.com/streaming/overview/request-parameters">https://dev.twitter.com/streaming/overview/request-parameters</a></p></div></div>
338<div class="section">
339<h4><a name="Lifecycle_of_a_Feed"></a>Lifecycle of a Feed</h4>
340<p>A feed is a logical artifact that is brought to life (i.e., its data flow is initiated) only when it is activated using the <tt>start feed</tt> statement. Before we active a feed, we need to designate the dataset where the data to be persisted using <tt>connect feed</tt> statement. Subsequent to a <tt>connect feed</tt> statement, the feed is said to be in the connected state. After that, <tt>start feed</tt> statement will activate the feed, and start the dataflow from feed to its connected dataset. Multiple feeds can simultaneously be connected to a dataset such that the contents of the dataset represent the union of the connected feeds. Also one feed can be simultaneously connected to multiple target datasets.</p>
341
342<div class="source">
343<div class="source">
344<pre> use dataverse feeds;
345
346 connect feed TwitterFeed to dataset Tweets;
347
348 start feed TwitterFeed;
349</pre></div></div>
350<p>The <tt>connect feed</tt> statement above directs AsterixDB to persist the data from <tt>TwitterFeed</tt> feed into the <tt>Tweets</tt> dataset. The <tt>start feed</tt> statement will activate the feed and start the dataflow. If it is required (by the high-level application) to also retain the raw tweets obtained from Twitter, the end user may additionally choose to connect TwitterFeed to a different dataset.</p>
351<p>Let the feed run for a minute, then run the following query to see the latest tweets that are stored into the data set.</p>
352
353<div class="source">
354<div class="source">
355<pre> use dataverse feeds;
356
357 for $i in dataset Tweets limit 10 return $i;
358</pre></div></div>
359<p>The dataflow of data from a feed can be terminated explicitly by <tt>stop feed</tt> statement.</p>
360
361<div class="source">
362<div class="source">
363<pre> use dataverse feeds;
364
365 stop feed TwitterFeed;
366</pre></div></div>
367<p>The <tt>disconnnect statement</tt> can be used to disconnect the feed from certain dataset.</p>
368
369<div class="source">
370<div class="source">
371<pre> use dataverse feeds;
372
373 disconnect feed TwitterFeed from dataset Tweets;
374</pre></div></div></div></div>
375<div class="section">
376<h3><a name="Ingesting_with_Other_Adapters"></a>Ingesting with Other Adapters</h3>
377<p>AsterixDB has several builtin feed adapters for data ingestion. User can also implement their own adapters and plug them into AsterixDB. Here we introduce <tt>rss_feed</tt>, <tt>socket_adapter</tt> and <tt>localfs</tt> feed adapter that cover most of the common application scenarios.</p>
378<div class="section">
379<div class="section">
380<h5><a name="Using_the_rss_feed_feed_adapter"></a>Using the &#x201c;rss_feed&#x201d; feed adapter</h5>
381<p><tt>rss_feed</tt> adapter allows retrieving data given a collection of RSS end point URLs. As observed in the case of ingesting tweets, it is required to model an RSS data item using AQL.</p>
382
383<div class="source">
384<div class="source">
385<pre> use dataverse feeds;
386
387 create type Rss if not exists as open {
388 id: string,
389 title: string,
390 description: string,
391 link: string
392 };
393
394 create dataset RssDataset (Rss)
395 primary key id;
396</pre></div></div>
397<p>Next, we define an RSS feed using our built-in adapter &#x201c;rss_feed&#x201d;.</p>
398
399<div class="source">
400<div class="source">
401<pre> use dataverse feeds;
402
403 create feed my_feed using
404 rss_feed (
405 (&quot;type-name&quot;=&quot;Rss&quot;),
406 (&quot;format&quot;=&quot;rss&quot;),
407 (&quot;url&quot;=&quot;http://rss.cnn.com/rss/edition.rss&quot;)
408 );
409</pre></div></div>
410<p>In the above definition, the configuration parameter &#x201c;url&#x201d; can be a comma-separated list that reflects a collection of RSS URLs, where each URL corresponds to an RSS endpoint or an RSS feed. The &#x201c;rss_feed&#x201d; retrieves data from each of the specified RSS URLs (comma separated values) in parallel.</p>
411<p>The following statements connect the feed into the <tt>RssDataset</tt>:</p>
412
413<div class="source">
414<div class="source">
415<pre> use dataverse feeds;
416
417 connect feed my_feed to dataset RssDataset;
418</pre></div></div>
419<p>The following statements activate the feed and start the dataflow:</p>
420
421<div class="source">
422<div class="source">
423<pre> use dataverse feeds;
424
425 start feed my_feed;
426</pre></div></div>
427<p>The following statements show the latest data from the data set, stop the feed, and disconnect the feed from the data set.</p>
428
429<div class="source">
430<div class="source">
431<pre> use dataverse feeds;
432
433 for $i in dataset RssDataset limit 10 return $i;
434
435 stop feed my_feed
436
437 disconnect feed my_feed from dataset RssDataset;
438</pre></div></div></div>
439<div class="section">
440<h5><a name="Using_the_socket_adapter_feed_adapter"></a>Using the &#x201c;socket_adapter&#x201d; feed adapter</h5>
441<p><tt>socket_adapter</tt> feed opens a web socket on the given node which allows user to push data into AsterixDB directly. Here is an example:</p>
442
443<div class="source">
444<div class="source">
445<pre> drop dataverse feeds if exists;
446 create dataverse feeds;
447 use dataverse feeds;
448
449 create type TestDataType as open {
450 screen-name: string
451 }
452
453 create dataset TestDataset(TestDataType) primary key screen-name;
454
455 create feed TestSocketFeed using socket_adapter
456 (
457 (&quot;sockets&quot;=&quot;127.0.0.1:10001&quot;),
458 (&quot;address-type&quot;=&quot;IP&quot;),
459 (&quot;type-name&quot;=&quot;TestDataType&quot;),
460 (&quot;format&quot;=&quot;adm&quot;)
461 );
462
463 connect feed TestSocketFeed to dataset TestDataset;
464
465 use dataverse feeds;
466 start feed TestSocketFeed;
467</pre></div></div>
468<p>The above statements create a socket feed which is listening to &#x201c;10001&#x201d; port of the host machine. This feed accepts data records in &#x201c;adm&#x201d; format. As an example, you can download the sample dataset <a href="../data/chu.adm">Chirp Users</a> and push them line by line into the socket feed using any socket client you like. Following is a socket client example in Python:</p>
469
470<div class="source">
471<div class="source">
472<pre> from socket import socket
473
474 ip = '127.0.0.1'
475 port1 = 10001
476 filePath = 'chu.adm'
477
478 sock1 = socket()
479 sock1.connect((ip, port1))
480
481 with open(filePath) as inputData:
482 for line in inputData:
483 sock1.sendall(line)
484 sock1.close()
485</pre></div></div></div></div>
486<div class="section">
487<h4><a name="Using_the_localfs_feed_adapter"></a>Using the &#x201c;localfs&#x201d; feed adapter</h4>
488<p><tt>localfs</tt> adapter enables data ingestion from local file system. It allows user to feed data records on local disk into a dataset. A DDL example for creating a <tt>localfs</tt> feed is given as follow:</p>
489
490<div class="source">
491<div class="source">
492<pre> use dataverse feeds;
493
494 create type TweetType as closed {
495 id: string,
496 username : string,
497 location : string,
498 text : string,
499 timestamp : string
500 }
501
502 create dataset Tweets(TweetType)
503 primary key id;
504
505 create feed TweetFeed
506 using localfs
507 ((&quot;type-name&quot;=&quot;TweetType&quot;),(&quot;path&quot;=&quot;HOSTNAME://LOCAL_FILE_PATH&quot;),(&quot;format&quot;=&quot;adm&quot;))
508</pre></div></div>
509<p>Similar to previous examples, we need to define the datatype and dataset this feed uses. The &#x201c;path&#x201d; parameter refers to the local datafile that we want to ingest data from. <tt>HOSTNAME</tt> can either be the IP address or node name of the machine which holds the file. <tt>LOCAL_FILE_PATH</tt> indicates the absolute path to the file on that machine. Similarly to <tt>socket_adapter</tt>, this feed takes <tt>adm</tt> formatted data records.</p></div></div>
510<div class="section">
511<h3><a name="Datatype_for_feed_and_target_dataset"></a>Datatype for feed and target dataset</h3>
512<p>The &#x201c;type-name&#x201d; parameter in create feed statement defines the <tt>datatype</tt> of the datasource. In most use cases, feed will have the same <tt>datatype</tt> as the target dataset. However, if we want to perform certain preprocess before the data records gets into the target dataset (append autogenerated key, apply user defined functions, etc.), we will need to define the datatypes for feed and dataset separately.</p>
513<div class="section">
514<h4><a name="Ingestion_with_autogenerated_key"></a>Ingestion with autogenerated key</h4>
515<p>AsterixDB supports using autogenerated uuid as the primary key for dataset. When we use this feature, we will need to define a datatype with the primary key field, and specify that field to be autogenerated when creating the dataset. Use that same datatype in feed definition will cause a type discrepancy since there is no such field in the datasource. Thus, we will need to define two separate datatypes for feed and dataset:</p>
516
517<div class="source">
518<div class="source">
519<pre> use dataverse feeds;
520
521 create type DBLPFeedType as closed {
522 dblpid: string,
523 title: string,
524 authors: string,
525 misc: string
526 }
527
528 create type DBLPDataSetType as open {
529 id: uuid,
530 dblpid: string,
531 title: string,
532 authors: string,
533 misc: string
534 }
535 create dataset DBLPDataset(DBLPDataSetType) primary key id autogenerated;
536
537 create feed DBLPFeed using socket_adapter
538 (
539 (&quot;sockets&quot;=&quot;127.0.0.1:10001&quot;),
540 (&quot;address-type&quot;=&quot;IP&quot;),
541 (&quot;type-name&quot;=&quot;DBLPFeedType&quot;),
542 (&quot;format&quot;=&quot;adm&quot;)
543 );
544
545 connect feed DBLPFeed to dataset DBLPDataset;
546
547 start feed DBLPFeed;
548</pre></div></div></div></div></div>
549<div class="section">
550<h2><a name="Policies_for_Feed_Ingestion"></a><a name="FeedPolicies">Policies for Feed Ingestion</a></h2>
551<p>Multiple feeds may be concurrently operational on an AsterixDB cluster, each competing for resources (CPU cycles, network bandwidth, disk IO) to maintain pace with their respective data sources. As a data management system, AsterixDB is able to manage a set of concurrent feeds and make dynamic decisions related to the allocation of resources, resolving resource bottlenecks and the handling of failures. Each feed has its own set of constraints, influenced largely by the nature of its data source and the applications that intend to consume and process the ingested data. Consider an application that intends to discover the trending topics on Twitter by analyzing tweets that are being processed. Losing a few tweets may be acceptable. In contrast, when ingesting from a data source that provides a click-stream of ad clicks, losing data would translate to a loss of revenue for an application that tracks revenue by charging advertisers per click.</p>
552<p>AsterixDB allows a data feed to have an associated ingestion policy that is expressed as a collection of parameters and associated values. An ingestion policy dictates the runtime behavior of the feed in response to resource bottlenecks and failures. AsterixDB provides a set of policies that help customize the system&#x2019;s runtime behavior when handling excess objects.</p>
553<div class="section">
554<div class="section">
555<h4><a name="Policies"></a>Policies</h4>
556
557<ul>
558
559<li>
560<p><i>Spill</i>: Objects that cannot be processed by an operator for lack of resources (referred to as excess objects hereafter) should be persisted to the local disk for deferred processing.</p></li>
561
562<li>
563<p><i>Discard</i>: Excess objects should be discarded.</p></li>
564</ul>
565<p>Note that the end user may choose to form a custom policy. For example, it is possible in AsterixDB to create a custom policy that spills excess objects to disk and subsequently resorts to throttling if the spillage crosses a configured threshold. In all cases, the desired ingestion policy is specified as part of the <tt>connect feed</tt> statement or else the &#x201c;Basic&#x201d; policy will be chosen as the default.</p>
566
567<div class="source">
568<div class="source">
569<pre> use dataverse feeds;
570
571 connect feed TwitterFeed to dataset Tweets
572 using policy Basic;
573</pre></div></div></div></div></div>
574 </div>
575 </div>
576 </div>
577
578 <hr/>
579
580 <footer>
581 <div class="container-fluid">
582 <div class="row span12">Copyright &copy; 2017
583 <a href="https://www.apache.org/">The Apache Software Foundation</a>.
584 All Rights Reserved.
585
586 </div>
587
588 <?xml version="1.0" encoding="UTF-8"?>
589<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
590 feather logo, and the Apache AsterixDB project logo are either
591 registered trademarks or trademarks of The Apache Software
592 Foundation in the United States and other countries.
593 All other marks mentioned may be trademarks or registered
594 trademarks of their respective owners.</div>
595
596
597 </div>
598 </footer>
599 </body>
600</html>