Blame - content/docs/0.9.8/csv.html - incubator-asterixdb-site

blob: d00792e5516b1680db0bbc366eda48469c87aea9 [file] [log] [blame]

Ian Maxon	858061a	2022-05-12 19:11:28 -0700	[diff] [blame]	1	<!DOCTYPE html>
				2	<!--
				3	\| Generated by Apache Maven Doxia Site Renderer 1.8.1 from src/site/markdown/csv.md at 2022-05-12
				4	\| Rendered using Apache Maven Fluido Skin 1.7
				5	-->
				6	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
				7	<head>
				8	<meta charset="UTF-8" />
				9	<meta name="viewport" content="width=device-width, initial-scale=1.0" />
				10	<meta name="Date-Revision-yyyymmdd" content="20220512" />
				11	<meta http-equiv="Content-Language" content="en" />
				12	<title>AsterixDB – CSV Support in AsterixDB</title>
				13	<link rel="stylesheet" href="./css/apache-maven-fluido-1.7.min.css" />
				14	<link rel="stylesheet" href="./css/site.css" />
				15	<link rel="stylesheet" href="./css/print.css" media="print" />
				16	<script type="text/javascript" src="./js/apache-maven-fluido-1.7.min.js"></script>
				17
				18	</head>
				19	<body class="topBarDisabled">
				20	<div class="container-fluid">
				21	<div id="banner">
				22	<div class="pull-left"><a href="./" id="bannerLeft"><img src="images/asterixlogo.png" alt="AsterixDB"/></a></div>
				23	<div class="pull-right"></div>
				24	<div class="clear"><hr/></div>
				25	</div>
				26
				27	<div id="breadcrumbs">
				28	<ul class="breadcrumb">
				29	<li id="publishDate">Last Published: 2022-05-12</li>
				30	<li id="projectVersion" class="pull-right">Version: 0.9.8</li>
				31	<li class="pull-right"><a href="index.html" title="Documentation Home">Documentation Home</a></li>
				32	</ul>
				33	</div>
				34	<div class="row-fluid">
				35	<div id="leftColumn" class="span2">
				36	<div class="well sidebar-nav">
				37	<ul class="nav nav-list">
				38	<li class="nav-header">Get Started - Installation</li>
				39	<li><a href="ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
				40	<li><a href="ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
				41	<li><a href="aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
				42	<li class="nav-header">AsterixDB Primer</li>
				43	<li><a href="sqlpp/primer-sqlpp.html" title="Using SQL++"><span class="none"></span>Using SQL++</a></li>
				44	<li class="nav-header">Data Model</li>
				45	<li><a href="datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
				46	<li class="nav-header">Queries</li>
				47	<li><a href="sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
				48	<li><a href="SQLPP.html" title="Raw SQL++ Grammar"><span class="none"></span>Raw SQL++ Grammar</a></li>
				49	<li><a href="sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
				50	<li class="nav-header">API/SDK</li>
				51	<li><a href="api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
				52	<li class="active"><a href="#"><span class="none"></span>CSV Output</a></li>
				53	<li class="nav-header">Advanced Features</li>
				54	<li><a href="aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
				55	<li><a href="feeds.html" title="Data Ingestion with Feeds"><span class="none"></span>Data Ingestion with Feeds</a></li>
				56	<li><a href="udf.html" title="User Defined Functions"><span class="none"></span>User Defined Functions</a></li>
				57	<li><a href="sqlpp/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
				58	<li><a href="sqlpp/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
				59	<li><a href="sqlpp/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
				60	<li><a href="geo/quickstart.html" title="GIS Support Overview"><span class="none"></span>GIS Support Overview</a></li>
				61	<li><a href="geo/functions.html" title="GIS Functions"><span class="none"></span>GIS Functions</a></li>
				62	<li><a href="interval_join.html" title="Support of Interval Joins"><span class="none"></span>Support of Interval Joins</a></li>
				63	<li><a href="spatial_join.html" title="Support of Spatial Joins"><span class="none"></span>Support of Spatial Joins</a></li>
				64	<li><a href="sqlpp/arrayindex.html" title="Support of Array Indexes"><span class="none"></span>Support of Array Indexes</a></li>
				65	<li class="nav-header">Deprecated</li>
				66	<li><a href="aql/primer.html" title="AsterixDB Primer: Using AQL"><span class="none"></span>AsterixDB Primer: Using AQL</a></li>
				67	<li><a href="aql/manual.html" title="Queries: The Asterix Query Language (AQL)"><span class="none"></span>Queries: The Asterix Query Language (AQL)</a></li>
				68	<li><a href="aql/builtins.html" title="Queries: Builtin Functions (AQL)"><span class="none"></span>Queries: Builtin Functions (AQL)</a></li>
				69	</ul>
				70	<hr />
				71	<div id="poweredBy">
				72	<div class="clear"></div>
				73	<div class="clear"></div>
				74	<div class="clear"></div>
				75	<div class="clear"></div>
				76	<a href="./" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="images/asterixlogo.png" /></a>
				77	</div>
				78	</div>
				79	</div>
				80	<div id="bodyColumn" class="span10" >
				81	<!--
				82	! Licensed to the Apache Software Foundation (ASF) under one
				83	! or more contributor license agreements. See the NOTICE file
				84	! distributed with this work for additional information
				85	! regarding copyright ownership. The ASF licenses this file
				86	! to you under the Apache License, Version 2.0 (the
				87	! "License"); you may not use this file except in compliance
				88	! with the License. You may obtain a copy of the License at
				89	!
				90	! http://www.apache.org/licenses/LICENSE-2.0
				91	!
				92	! Unless required by applicable law or agreed to in writing,
				93	! software distributed under the License is distributed on an
				94	! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
				95	! KIND, either express or implied. See the License for the
				96	! specific language governing permissions and limitations
				97	! under the License.
				98	!-->
				99	<h1>CSV Support in AsterixDB</h1>
				100	<div class="section">
				101	<h2><a name="Introduction_-_Defining_a_datatype_for_CSV"></a>Introduction - Defining a datatype for CSV</h2>
				102	<p>AsterixDB supports the CSV format for both data input and query result output. In both cases, the structure of the CSV data must be defined using a named ADM object datatype. The CSV format, limitations, and MIME type are defined by <a class="externalLink" href="https://tools.ietf.org/html/rfc4180">RFC 4180</a>.</p>
				103	<p>CSV is not as expressive as the full Asterix Data Model, meaning that not all data which can be represented in ADM can also be represented as CSV. So the form of this datatype is limited. First, obviously it may not contain any nested objects or lists, as CSV has no way to represent nested data structures. All fields in the object type must be primitive. Second, the set of supported primitive types is limited to numerics (<tt>int8</tt>, <tt>int16</tt>, <tt>int32</tt>, <tt>int64</tt>, <tt>float</tt>, <tt>double</tt>) and <tt>string</tt>. On output, a few additional primitive types (<tt>boolean</tt>, datetime types) are supported and will be represented as strings.</p>
				104	<p>For the purposes of this document, we will use the following dataverse and datatype definitions:</p>
				105
				106	<div>
				107	<div>
				108	<pre class="source">drop dataverse csv if exists;
				109	create dataverse csv;
				110	use dataverse csv;
				111
				112	create type "csv_type" as closed {
				113	"id": int32,
				114	"money": float,
				115	"name": string
				116	};
				117
				118	create dataset "csv_set" ("csv_type") primary key "id";
				119	</pre></div></div>
				120
				121	<p>Note: There is no explicit restriction against using an open datatype for CSV purposes, and you may have optional fields in the datatype (eg., <tt>id: int32?</tt>). However, the CSV format itself is rigid, so using either of these datatype features introduces possible failure modes on output which will be discussed below.</p></div>
				122	<div class="section">
				123	<h2><a name="CSV_Input"></a>CSV Input</h2>
				124	<p>CSV data may be loaded into a dataset using the normal “load dataset” mechanisms, utilizing the builtin “delimited-text” format. See <a href="aql/externaldata.html">Accessing External Data</a> for more details. Note that comma is the default value for the “delimiter” parameter, so it does not need to be explicitly specified.</p>
				125	<p>In this case, the datatype used to interpret the CSV data is the datatype associated with the dataset being loaded. So, to load a file that we have stored locally on the NC into our example dataset:</p>
				126
				127	<div>
				128	<div>
				129	<pre class="source">use dataverse csv;
				130
				131	load dataset "csv_set" using localfs
				132	(("path"="127.0.0.1:///tmp/my_sample.csv"),
				133	("format"="delimited-text"));
				134	</pre></div></div>
				135
				136	<p>So, if the file <tt>/tmp/my_sample.csv</tt> contained</p>
				137
				138	<div>
				139	<div>
				140	<pre class="source">1,18.50,"Peter Krabnitz"
				141	2,74.50,"Jesse Stevens"
				142	</pre></div></div>
				143
				144	<p>then the preceding query would load it into the dataset <tt>csv_set</tt>.</p>
				145	<p>If your CSV file has a header (that is, the first line contains a set of field names, rather than actual data), you can instruct Asterix to ignore this header by adding the parameter <tt>"header"="true"</tt>, eg.</p>
				146
				147	<div>
				148	<div>
				149	<pre class="source">load dataset "csv_set" using localfs
				150	(("path"="127.0.0.1:///tmp/my_header_sample.csv"),
				151	("format"="delimited-text"),
				152	("header"="true"));
				153	</pre></div></div>
				154
				155	<p>CSV data may also be loaded from HDFS; see <a href="aql/externaldata.html">Accessing External Data</a> for details. However please note that CSV files on HDFS cannot have headers. Attempting to specify “header”=“true” when reading from HDFS could result in non-header lines of data being skipped as well.</p></div>
				156	<div class="section">
				157	<h2><a name="CSV_Output"></a>CSV Output</h2>
				158	<p>Any query may be rendered as CSV when using AsterixDB’s HTTP interface. To do so, there are two steps required: specify the object type which defines the schema of your CSV, and request that Asterix use the CSV output format.</p>
				159	<div class="section">
				160	<div class="section">
				161	<h4><a name="Output_Object_Type"></a>Output Object Type</h4>
				162	<p>Background: The result of any AQL query is an unordered list of <i>instances</i>, where each <i>instance</i> is an instance of an AQL datatype. When requesting CSV output, there are some restrictions on the legal datatypes in this unordered list due to the limited expressability of CSV:</p>
				163	<ol style="list-style-type: decimal">
				164
				165	<li>Each instance must be of a object type.</li>
				166	<li>Each instance must be of the <i>same</i> object type.</li>
				167	<li>The object type must conform to the content and type restrictions mentioned in the introduction.</li>
				168	</ol>
				169	<p>While it would be possible to structure your query to cast all result instances to a given type, it is not necessary. AQL offers a built-in feature which will automatically cast all top-level instances in the result to a specified named ADM object type. To enable this feature, use a <tt>set</tt> statement prior to the query to set the parameter <tt>output-record-type</tt> to the name of an ADM type. This type must have already been defined in the current dataverse.</p>
				170	<p>For example, the following request will ensure that all result instances are cast to the <tt>csv_type</tt> type declared earlier:</p>
				171
				172	<div>
				173	<div>
				174	<pre class="source">use dataverse csv;
				175	set output-record-type "csv_type";
				176
				177	for $n in dataset "csv_set" return $n;
				178	</pre></div></div>
				179
				180	<p>In this case the casting is redundant since by definition every value in <tt>csv_set</tt> is already of type <tt>csv_type</tt>. But consider a more complex query where the result values are created by joining fields from different underlying datasets, etc.</p>
				181	<p>Two notes about <tt>output-record-type</tt>:</p>
				182	<ol style="list-style-type: decimal">
				183
				184	<li>This feature is not strictly related to CSV; it may be used with any output formats (in which case, any object datatype may be specified, not subject to the limitations specified in the introduction of this page).</li>
				185	<li>When the CSV output format is requested, <tt>output-record-type</tt> is in fact required, not optional. This is because the type is used to determine the field names for the CSV header and to ensure that the ordering of fields in the output is consistent (which is obviously vital for the CSV to make any sense).</li>
				186	</ol></div>
				187	<div class="section">
				188	<h4><a name="Request_the_CSV_Output_Format"></a>Request the CSV Output Format</h4>
				189	<p>When sending requests to the Asterix HTTP API, Asterix decides what format to use for rendering the results in one of two ways:</p>
				190	<ul>
				191
				192	<li>
				193
				194	<p>A HTTP query parameter named “output”, which must be set to one of the following values: <tt>JSON</tt>, <tt>CSV</tt>, or <tt>ADM</tt>.</p>
				195	</li>
				196	<li>
				197
				198	<p>Based on the <a class="externalLink" href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1"><tt>Accept</tt> HTTP header</a></p>
				199	</li>
				200	</ul>
				201	<p>By default, Asterix will produce JSON output. To select CSV output, pass the parameter <tt>output=CSV</tt>, or set the <tt>Accept</tt> header on your request to the MIME type <tt>text/csv</tt>. The details of how to accomplish this will of course depend on what tools you are using to contact the HTTP API. Here is an example from a Unix shell prompt using the command-line utility “curl” and specifying the "output query parameter:</p>
				202
				203	<div>
				204	<div>
				205	<pre class="source">curl -G "http://localhost:19002/query" \
				206	--data-urlencode 'output=CSV' \
				207	--data-urlencode 'query=use dataverse csv;
				208	set output-record-type "csv_type";
				209	for $n in dataset csv_set return $n;'
				210	</pre></div></div>
				211
				212	<p>Alternately, the same query using the <tt>Accept</tt> header:</p>
				213
				214	<div>
				215	<div>
				216	<pre class="source">curl -G -H "Accept: text/csv" "http://localhost:19002/query" \
				217	--data-urlencode 'query=use dataverse csv;
				218	set output-record-type "csv_type";
				219	for $n in dataset csv_set return $n;'
				220	</pre></div></div>
				221
				222	<p>Similarly, a trivial Java program to execute the above sample query and selecting CSV output via the <tt>Accept</tt> header would be:</p>
				223
				224	<div>
				225	<div>
				226	<pre class="source">import java.net.HttpURLConnection;
				227	import java.net.URL;
				228	import java.net.URLEncoder;
				229	import java.io.BufferedReader;
				230	import java.io.InputStream;
				231	import java.io.InputStreamReader;
				232
				233	public class AsterixExample {
				234	public static void main(String[] args) throws Exception {
				235	String query = "use dataverse csv; " +
				236	"set output-record-type \"csv_type\";" +
				237	"for $n in dataset csv_set return $n";
				238	URL asterix = new URL("http://localhost:19002/query?query=" +
				239	URLEncoder.encode(query, "UTF-8"));
				240	HttpURLConnection conn = (HttpURLConnection) asterix.openConnection();
				241	conn.setRequestProperty("Accept", "text/csv");
				242	BufferedReader result = new BufferedReader
				243	(new InputStreamReader(conn.getInputStream()));
				244	String line;
				245	while ((line = result.readLine()) != null) {
				246	System.out.println(line);
				247	}
				248	result.close();
				249	}
				250	}
				251	</pre></div></div>
				252
				253	<p>For either of the above examples, the output would be:</p>
				254
				255	<div>
				256	<div>
				257	<pre class="source">1,18.5,"Peter Krabnitz"
				258	2,74.5,"Jesse Stevens"
				259	</pre></div></div>
				260
				261	<p>assuming you had already run the previous examples to create the dataverse and populate the dataset.</p></div>
				262	<div class="section">
				263	<h4><a name="Outputting_CSV_with_a_Header"></a>Outputting CSV with a Header</h4>
				264	<p>By default, AsterixDB will produce CSV results with no header line. If you want a header, you may explicitly request it in one of two ways:</p>
				265	<ul>
				266
				267	<li>
				268
				269	<p>By passing the HTTP query parameter “header” with the value “present”</p>
				270	</li>
				271	<li>
				272
				273	<p>By specifying the MIME type {{text/csv; header=present}} in your HTTP Accept: header. This is consistent with RFC 4180.</p>
				274	</li>
				275	</ul></div>
				276	<div class="section">
				277	<h4><a name="Issues_with_open_datatypes_and_optional_fields"></a>Issues with open datatypes and optional fields</h4>
				278	<p>As mentioned earlier, CSV is a rigid format. It cannot express objects with different numbers of fields, which ADM allows through both open datatypes and optional fields.</p>
				279	<p>If your output object type contains optional fields, this will not result in any errors. If the output data of a query does not contain values for an optional field, this will be represented in CSV as <tt>null</tt>.</p>
				280	<p>If your output object type is open, this will also not result in any errors. If the output data of a query contains any open fields, the corresponding rows in the resulting CSV will contain more comma-separated values than the others. On each such row, the data from the closed fields in the type will be output first in the normal order, followed by the data from the open fields in an arbitrary order.</p>
				281	<p>According to RFC 4180 this is not strictly valid CSV (Section 2, rule 4, “Each line <i>should</i> contain the same number of fields throughout the file”). Hence it will likely not be handled consistently by all CSV processors. Some may throw a parsing error. If you attempt to load this data into AsterixDB later using <tt>load dataset</tt>, the extra fields will be silently ignored. For this reason it is recommended that you use only closed datatypes as output object types. AsterixDB allows to use an open object type only to support cases where the type already exists for other parts of your application.</p></div></div></div>
				282	</div>
				283	</div>
				284	</div>
				285	<hr/>
				286	<footer>
				287	<div class="container-fluid">
				288	<div class="row-fluid">
				289	<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
				290	feather logo, and the Apache AsterixDB project logo are either
				291	registered trademarks or trademarks of The Apache Software
				292	Foundation in the United States and other countries.
				293	All other marks mentioned may be trademarks or registered
				294	trademarks of their respective owners.
				295	</div>
				296	</div>
				297	</div>
				298	</footer>
				299	</body>
				300	</html>