blob: eda382a356cf780b1b825c90cd5db17edc804bbe [file] [log] [blame]
Ian Maxoned124d82015-05-29 18:44:11 -07001<!DOCTYPE html>
2<!--
3 | Generated by Apache Maven Doxia at 2015-05-29
4 | Rendered using Apache Maven Fluido Skin 1.3.0
5-->
6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7 <head>
8 <meta charset="UTF-8" />
9 <meta name="viewport" content="width=device-width, initial-scale=1.0" />
10 <meta name="Date-Revision-yyyymmdd" content="20150529" />
11 <meta http-equiv="Content-Language" content="en" />
12 <title>AsterixDB - </title>
13 <link rel="stylesheet" href="./css/apache-maven-fluido-1.3.0.min.css" />
14 <link rel="stylesheet" href="./css/site.css" />
15 <link rel="stylesheet" href="./css/print.css" media="print" />
16
17
18 <script type="text/javascript" src="./js/apache-maven-fluido-1.3.0.min.js"></script>
19
20
21 </head>
22 <body class="topBarDisabled">
23
24
25
26
27 <div class="container-fluid">
28 <div id="banner">
29 <div class="pull-left">
30 <a href="./" id="bannerLeft">
31 <img src="images/asterixlogo.png" alt="AsterixDB"/>
32 </a>
33 </div>
34 <div class="pull-right"> <a href="http://incubator.apache.org/" id="bannerRight">
35 <img src="images/egg-logo.png" alt="Apache Software Foundation Incubator"/>
36 </a>
37 </div>
38 <div class="clear"><hr/></div>
39 </div>
40
41 <div id="breadcrumbs">
42 <ul class="breadcrumb">
43
44
45 <li id="publishDate">Last Published: 2015-05-29</li>
46
47
48
49 <li id="projectVersion" class="pull-right">Version: 0.8.7-SNAPSHOT</li>
50
51 <li class="divider pull-right">|</li>
52
53 <li class="pull-right"> <a href="index.html" title="Home">
54 Home</a>
55 </li>
56
57 </ul>
58 </div>
59
60
61 <div class="row-fluid">
62 <div id="leftColumn" class="span3">
63 <div class="well sidebar-nav">
64
65
66 <ul class="nav nav-list">
67 <li class="nav-header">Apache Software Foundation</li>
68
69 <li>
70
71 <a href="http://www.apache.org/" class="externalLink" title="Home">
72 <i class="none"></i>
73 Home</a>
74 </li>
75
76 <li>
77
78 <a href="http://www.apache.org/foundation/sponsorship.html" class="externalLink" title="Donate">
79 <i class="none"></i>
80 Donate</a>
81 </li>
82
83 <li>
84
85 <a href="http://www.apache.org/foundation/thanks.html" class="externalLink" title="Thanks">
86 <i class="none"></i>
87 Thanks</a>
88 </li>
89
90 <li>
91
92 <a href="http://www.apache.org/security/" class="externalLink" title="Security">
93 <i class="none"></i>
94 Security</a>
95 </li>
96 <li class="nav-header">User Documentation</li>
97
98 <li>
99
100 <a href="install.html" title="Installing and Managing AsterixDB using Managix">
101 <i class="none"></i>
102 Installing and Managing AsterixDB using Managix</a>
103 </li>
104
105 <li>
106
107 <a href="aql/primer.html" title="AsterixDB 101: An ADM and AQL Primer">
108 <i class="none"></i>
109 AsterixDB 101: An ADM and AQL Primer</a>
110 </li>
111
112 <li>
113
114 <a href="aql/primer-sql-like.html" title="AsterixDB 101: An ADM and AQL Primer (For SQL Fans)">
115 <i class="none"></i>
116 AsterixDB 101: An ADM and AQL Primer (For SQL Fans)</a>
117 </li>
118
119 <li>
120
121 <a href="aql/js-sdk.html" title="AsterixDB Javascript SDK">
122 <i class="none"></i>
123 AsterixDB Javascript SDK</a>
124 </li>
125
126 <li>
127
128 <a href="aql/datamodel.html" title="Asterix Data Model (ADM)">
129 <i class="none"></i>
130 Asterix Data Model (ADM)</a>
131 </li>
132
133 <li>
134
135 <a href="aql/manual.html" title="Asterix Query Language (AQL)">
136 <i class="none"></i>
137 Asterix Query Language (AQL)</a>
138 </li>
139
140 <li>
141
142 <a href="aql/functions.html" title="AQL Functions">
143 <i class="none"></i>
144 AQL Functions</a>
145 </li>
146
147 <li>
148
149 <a href="aql/allens.html" title="AQL Allen's Relations Functions">
150 <i class="none"></i>
151 AQL Allen's Relations Functions</a>
152 </li>
153
154 <li>
155
156 <a href="aql/similarity.html" title="AQL Support of Similarity Queries">
157 <i class="none"></i>
158 AQL Support of Similarity Queries</a>
159 </li>
160
161 <li>
162
163 <a href="aql/externaldata.html" title="Accessing External Data">
164 <i class="none"></i>
165 Accessing External Data</a>
166 </li>
167
168 <li>
169
170 <a href="aql/filters.html" title="Filter-Based LSM Index Acceleration">
171 <i class="none"></i>
172 Filter-Based LSM Index Acceleration</a>
173 </li>
174
175 <li>
176
177 <a href="api.html" title="REST API to AsterixDB">
178 <i class="none"></i>
179 REST API to AsterixDB</a>
180 </li>
181 </ul>
182
183
184
185 <hr class="divider" />
186
187 <div id="poweredBy">
188 <div class="clear"></div>
189 <div class="clear"></div>
190 <div class="clear"></div>
191 <a href="./" title="Hyracks" class="builtBy">
192 <img class="builtBy" alt="Hyracks" src="images/hyrax_ts.png" />
193 </a>
194 </div>
195 </div>
196 </div>
197
198
199 <div id="bodyColumn" class="span9" >
200
201 <h1>CSV Support in AsterixDB</h1>
202<div class="section">
203<h2>Introduction - Defining a datatype for CSV<a name="Introduction_-_Defining_a_datatype_for_CSV"></a></h2>
204<p>AsterixDB supports the CSV format for both data input and query result output. In both cases, the structure of the CSV data must be defined using a named ADM record datatype. The CSV format, limitations, and MIME type are defined by <a class="externalLink" href="https://tools.ietf.org/html/rfc4180">RFC 4180</a>.</p>
205<p>CSV is not as expressive as the full Asterix Data Model, meaning that not all data which can be represented in ADM can also be represented as CSV. So the form of this datatype is limited. First, obviously it may not contain any nested records or lists, as CSV has no way to represent nested data structures. All fields in the record type must be primitive. Second, the set of supported primitive types is limited to numerics (<tt>int8</tt>, <tt>int16</tt>, <tt>int32</tt>, <tt>int64</tt>, <tt>float</tt>, <tt>double</tt>) and <tt>string</tt>. On output, a few additional primitive types (<tt>boolean</tt>, datetime types) are supported and will be represented as strings.</p>
206<p>For the purposes of this document, we will use the following dataverse and datatype definitions:</p>
207
208<div class="source">
209<pre>drop dataverse csv if exists;
210create dataverse csv;
211use dataverse csv;
212
213create type &quot;csv_type&quot; as closed {
214 &quot;id&quot;: int32,
215 &quot;money&quot;: float,
216 &quot;name&quot;: string
217};
218
219create dataset &quot;csv_set&quot; (&quot;csv_type&quot;) primary key &quot;id&quot;;
220</pre></div>
221<p>Note: There is no explicit restriction against using an open datatype for CSV purposes, and you may have optional fields in the datatype (eg., <tt>id: int32?</tt>). However, the CSV format itself is rigid, so using either of these datatype features introduces possible failure modes on output which will be discussed below.</p></div>
222<div class="section">
223<h2>CSV Input<a name="CSV_Input"></a></h2>
224<p>CSV data may be loaded into a dataset using the normal &#x201c;load dataset&#x201d; mechanisms, utilizing the builtin &#x201c;delimited-text&#x201d; format. See <a href="aql/externaldata.html">Accessing External Data</a> for more details. Note that comma is the default value for the &#x201c;delimiter&#x201d; parameter, so it does not need to be explicitly specified.</p>
225<p>In this case, the datatype used to interpret the CSV data is the datatype associated with the dataset being loaded. So, to load a file that we have stored locally on the NC into our example dataset:</p>
226
227<div class="source">
228<pre>use dataverse csv;
229
230load dataset &quot;csv_set&quot; using localfs
231((&quot;path&quot;=&quot;127.0.0.1:///tmp/my_sample.csv&quot;),
232 (&quot;format&quot;=&quot;delimited-text&quot;));
233</pre></div>
234<p>So, if the file <tt>/tmp/my_sample.csv</tt> contained</p>
235
236<div class="source">
237<pre>1,18.50,&quot;Peter Krabnitz&quot;
2382,74.50,&quot;Jesse Stevens&quot;
239</pre></div>
240<p>then the preceding query would load it into the dataset <tt>csv_set</tt>.</p>
241<p>If your CSV file has a header (that is, the first line contains a set of field names, rather than actual data), you can instruct Asterix to ignore this header by adding the parameter <tt>&quot;header&quot;=&quot;true&quot;</tt>, eg.</p>
242
243<div class="source">
244<pre>load dataset &quot;csv_set&quot; using localfs
245((&quot;path&quot;=&quot;127.0.0.1:///tmp/my_header_sample.csv&quot;),
246 (&quot;format&quot;=&quot;delimited-text&quot;),
247 (&quot;header&quot;=&quot;true&quot;));
248</pre></div>
249<p>CSV data may also be loaded from HDFS; see <a href="aql/externaldata.html">Accessing External Data</a> for details. However please note that CSV files on HDFS cannot have headers. Attempting to specify &#x201c;header&#x201d;=&#x201c;true&#x201d; when reading from HDFS could result in non-header lines of data being skipped as well.</p></div>
250<div class="section">
251<h2>CSV Output<a name="CSV_Output"></a></h2>
252<p>Any query may be rendered as CSV when using AsterixDB&#x2019;s HTTP interface. To do so, there are two steps required: specify the record type which defines the schema of your CSV, and request that Asterix use the CSV output format.</p>
253<div class="section">
254<div class="section">
255<h4>Output Record Type<a name="Output_Record_Type"></a></h4>
256<p>Background: The result of any AQL query is an unordered list of <i>instances</i>, where each <i>instance</i> is an instance of an AQL datatype. When requesting CSV output, there are some restrictions on the legal datatypes in this unordered list due to the limited expressability of CSV:</p>
257
258<ol style="list-style-type: decimal">
259
260<li>Each instance must be of a record type.</li>
261
262<li>Each instance must be of the <i>same</i> record type.</li>
263
264<li>The record type must conform to the content and type restrictions mentioned in the introduction.</li>
265</ol>
266<p>While it would be possible to structure your query to cast all result instances to a given type, it is not necessary. AQL offers a built-in feature which will automatically cast all top-level instances in the result to a specified named ADM record type. To enable this feature, use a <tt>set</tt> statement prior to the query to set the parameter <tt>output-record-type</tt> to the name of an ADM type. This type must have already been defined in the current dataverse.</p>
267<p>For example, the following request will ensure that all result instances are cast to the <tt>csv_type</tt> type declared earlier:</p>
268
269<div class="source">
270<pre>use dataverse csv;
271set output-record-type &quot;csv_type&quot;;
272
273for $n in dataset &quot;csv_set&quot; return $n;
274</pre></div>
275<p>In this case the casting is redundant since by definition every value in <tt>csv_set</tt> is already of type <tt>csv_type</tt>. But consider a more complex query where the result values are created by joining fields from different underlying datasets, etc.</p>
276<p>Two notes about <tt>output-record-type</tt>:</p>
277
278<ol style="list-style-type: decimal">
279
280<li>This feature is not strictly related to CSV; it may be used with any output formats (in which case, any record datatype may be specified, not subject to the limitations specified in the introduction of this page).</li>
281
282<li>When the CSV output format is requested, <tt>output-record-type</tt> is in fact required, not optional. This is because the type is used to determine the field names for the CSV header and to ensure that the ordering of fields in the output is consistent (which is obviously vital for the CSV to make any sense).</li>
283</ol></div>
284<div class="section">
285<h4>Request the CSV Output Format<a name="Request_the_CSV_Output_Format"></a></h4>
286<p>When sending requests to the Asterix HTTP API, Asterix decides what format to use for rendering the results in one of two ways:</p>
287
288<ul>
289
290<li>
291<p>A HTTP query parameter named &#x201c;output&#x201d;, which must be set to one of the following values: <tt>JSON</tt>, <tt>CSV</tt>, or <tt>ADM</tt>.</p></li>
292
293<li>
294<p>Based on the <a class="externalLink" href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1"><tt>Accept</tt> HTTP header</a></p></li>
295</ul>
296<p>By default, Asterix will produce JSON output. To select CSV output, pass the parameter <tt>output=CSV</tt>, or set the <tt>Accept</tt> header on your request to the MIME type <tt>text/csv</tt>. The details of how to accomplish this will of course depend on what tools you are using to contact the HTTP API. Here is an example from a Unix shell prompt using the command-line utility &#x201c;curl&#x201d; and specifying the &quot;output query parameter:</p>
297
298<div class="source">
299<pre>curl -G &quot;http://localhost:19002/query&quot; \
300 --data-urlencode 'output=CSV' \
301 --data-urlencode 'query=use dataverse csv;
302 set output-record-type &quot;csv_type&quot;;
303 for $n in dataset csv_set return $n;'
304</pre></div>
305<p>Alternately, the same query using the <tt>Accept</tt> header:</p>
306
307<div class="source">
308<pre>curl -G -H &quot;Accept: text/csv&quot; &quot;http://localhost:19002/query&quot; \
309 --data-urlencode 'query=use dataverse csv;
310 set output-record-type &quot;csv_type&quot;;
311 for $n in dataset csv_set return $n;'
312</pre></div>
313<p>Similarly, a trivial Java program to execute the above sample query and selecting CSV output via the <tt>Accept</tt> header would be:</p>
314
315<div class="source">
316<pre>import java.net.HttpURLConnection;
317import java.net.URL;
318import java.net.URLEncoder;
319import java.io.BufferedReader;
320import java.io.InputStream;
321import java.io.InputStreamReader;
322
323public class AsterixExample {
324 public static void main(String[] args) throws Exception {
325 String query = &quot;use dataverse csv; &quot; +
326 &quot;set output-record-type \&quot;csv_type\&quot;;&quot; +
327 &quot;for $n in dataset csv_set return $n&quot;;
328 URL asterix = new URL(&quot;http://localhost:19002/query?query=&quot; +
329 URLEncoder.encode(query, &quot;UTF-8&quot;));
330 HttpURLConnection conn = (HttpURLConnection) asterix.openConnection();
331 conn.setRequestProperty(&quot;Accept&quot;, &quot;text/csv&quot;);
332 BufferedReader result = new BufferedReader
333 (new InputStreamReader(conn.getInputStream()));
334 String line;
335 while ((line = result.readLine()) != null) {
336 System.out.println(line);
337 }
338 result.close();
339 }
340}
341</pre></div>
342<p>For either of the above examples, the output would be:</p>
343
344<div class="source">
345<pre>1,18.5,&quot;Peter Krabnitz&quot;
3462,74.5,&quot;Jesse Stevens&quot;
347</pre></div>
348<p>assuming you had already run the previous examples to create the dataverse and populate the dataset.</p></div>
349<div class="section">
350<h4>Outputting CSV with a Header<a name="Outputting_CSV_with_a_Header"></a></h4>
351<p>By default, AsterixDB will produce CSV results with no header line. If you want a header, you may explicitly request it in one of two ways:</p>
352
353<ul>
354
355<li>
356<p>By passing the HTTP query parameter &#x201c;header&#x201d; with the value &#x201c;present&#x201d;</p></li>
357
358<li>
359<p>By specifying the MIME type {{text/csv; header=present}} in your HTTP Accept: header. This is consistent with RFC 4180.</p></li>
360</ul></div>
361<div class="section">
362<h4>Issues with open datatypes and optional fields<a name="Issues_with_open_datatypes_and_optional_fields"></a></h4>
363<p>As mentioned earlier, CSV is a rigid format. It cannot express records with different numbers of fields, which ADM allows through both open datatypes and optional fields.</p>
364<p>If your output record type contains optional fields, this will not result in any errors. If the output data of a query does not contain values for an optional field, this will be represented in CSV as <tt>null</tt>.</p>
365<p>If your output record type is open, this will also not result in any errors. If the output data of a query contains any open fields, the corresponding rows in the resulting CSV will contain more comma-separated values than the others. On each such row, the data from the closed fields in the type will be output first in the normal order, followed by the data from the open fields in an arbitrary order.</p>
366<p>According to RFC 4180 this is not strictly valid CSV (Section 2, rule 4, &#x201c;Each line <i>should</i> contain the same number of fields throughout the file&#x201d;). Hence it will likely not be handled consistently by all CSV processors. Some may throw a parsing error. If you attempt to load this data into AsterixDB later using <tt>load dataset</tt>, the extra fields will be silently ignored. For this reason it is recommended that you use only closed datatypes as output record types. AsterixDB allows to use an open record type only to support cases where the type already exists for other parts of your application.</p></div></div></div>
367 </div>
368 </div>
369 </div>
370
371 <hr/>
372
373 <footer>
374 <div class="container-fluid">
375 <div class="row span12">Copyright &copy; 2015.
376 All Rights Reserved.
377
378 </div>
379
380 <?xml version="1.0" encoding="UTF-8"?>
381<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
382 feather logo, and the Apache AsterixDB project logo are either
383 registered trademarks or trademarks of The Apache Software
384 Foundation in the United States and other countries.
385 All other marks mentioned may be trademarks or registered
386 trademarks of their respective owners.</div>
387
388
389 </div>
390 </footer>
391 </body>
392</html>