- Article
Important
This feature is in Public Preview.
Important
To access Databricks REST APIs, you must authenticate.
This tutorial shows how to use the Databricks SQL Statement Execution API 2.0 to run SQL statements from Databricks SQL warehouses.
To view and to download an OpenAPI Specification-compatible version of the Databricks SQL Statement Execution API 2.0 reference, see Statement Execution API 2.0.
Before you begin
Before you start this tutorial, be sure that you have:
The workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net
, for your Azure Databricks workspace. This tutorial assumes that you have an environment variable on your local development machine namedDATABRICKS_HOST
, which is set to this value. To learn how to set environment variables, see your operating system’s documentation.Warning
Databricks strongly discourages hard-coding information into your scripts, as this sensitive information can be exposed in plain text through version control systems. Databricks recommends that you use approaches such as environment variables that you set on your development machine instead. Removing such hard-coded information from your scripts helps to make those scripts more portable as well.
An Azure Databricks personal access token or Azure Active Directory (Azure AD) token for your Azure Databricks workspace user. This tutorial assumes that you have an environment variable on your local development machine named
DATABRICKS_TOKEN
, which is set to the value of your token.See AlsoNhậnNCEI_Documentation_v1Accedere agli utenti e chiamare un'API in un'applicazione Web Node.js: acquisire un token di accesso - Microsoft EntraInicio de sesión de usuarios y llamada a una API en una aplicación web de Node.js: adquirir un token de acceso - Microsoft EntraNote
As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage personal access tokens for a service principal.
At least one table that you can execute SQL statements against. This tutorial is based on the
lineitem
table in thetpch
schema (also known as a database) within thesamples
catalog. If you do not have access to this catalog, schema, or table from your workspace, substitute them throughout this tutorial with your own.curl, a command-line tool for sending and receiving REST API requests and responses. See also Install curl. Alternatively, you can adapt this tutorial’s examples for use with similar tools such as Postman or HTTPie.
For each of the
curl
examples used in this article:- Instead of
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
, you can use a .netrc file. If you use a.netrc
file, replace--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
with--netrc
. - If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace
\
with^
, and replace${...}
with%...%
. - If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing
'
with"
, and replace inner"
with\"
.
- Instead of
jq, a command-line processor for querying JSON response payloads, which the Databricks SQL Statement Execution API returns to you after each call that you make to the Databricks SQL Statement Execution API. See also Download jq.
A Databricks SQL warehouse. This tutorial assumes that you have an environment variable on your local development machine named
DATABRICKS_SQL_WAREHOUSE_ID
, which is the string of letters and numbers following/sql/1.0/warehouses/
in the HTTP path field for your warehouse. To learn how to get your warehouse’s HTTP path value, see Get connection details for a SQL warehouse.The Databricks SQL warehouse must be set to use the Preview channel in Advanced options. Note that switching a warehouse from using the Current channel to the Preview channel will cause the warehouse to automatically restart. This upgrade could take several minutes.
Step 1: Execute a SQL statement and save the data result as JSON
Run the following command, which does the following:
- Uses the specified SQL warehouse, along with the specified token, to query for three columns from the first two rows of the
lineitem
table in thetcph
schema within thesamples
catalog. - Saves the response payload in JSON format in a file named
sql-execution-response.json
within the current working directory. - Prints the contents of the
sql-execution-response.json
file. - Sets a local environment variable named
SQL_STATEMENT_ID
that contains the ID of the corresponding SQL statement. You can use this SQL statement ID for getting information about that statement later as needed, which is demonstrated in Step 2. You can also view this SQL statement and get its statement ID from the query history section of the Databricks SQL console, or by calling the Query History API 2.0. - Sets an additional local environment variable named
NEXT_CHUNK_EXTERNAL_LINK
that contains an API URL fragment for getting the next chunk of JSON data. If the response data is too large, the Databricks SQL Statement Execution API provides the response in chunks. You can use this API URL fragment for getting the next chunk of data, which is demonstrated in Step 2. If there is no next chunk, then this environment variable is set tonull
. - Prints the values of the
SQL_STATEMENT_ID
andNEXT_CHUNK_INTERNAL_LINK
environment variables.
curl --request POST \https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--header "Content-Type: application/json" \--data '{ "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'", "catalog": "samples", "schema": "tpch", "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 2"}' \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json' \&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
In the preceding request:
- By default, any returned data is in JSON array format, and the default location for any of the SQL statement’s data results is within the response payload. To make this behavior explicit, add
"format":"JSON_ARRAY","disposition":"INLINE"
to the request payload. If you attempt to return data results larger than 16 MiB in the response payload, a failure status is returned and the SQL statement is canceled. For data results larger than 16 MiB, you can use external links instead of trying to return it in the response payload, which is demonstrated in Step 3. curl
stores the response payload’s contents to a local file. Local data storage is not supported by the Databricks SQL Statement Execution API directly.- By default, after 10 seconds, if the SQL statement has not yet finished executing through the warehouse, the Databricks SQL Statement Execution API returns only the SQL statement ID and its current status, instead of the statement’s result. To change this behavior, add
"wait_timeout":"<x>s"
to the request payload, where<x>
can be between5
and50
seconds inclusive, for example"wait_timeout":"50s"
. To return the SQL statement ID and its current status immediately, setwait_timeout
to0s
. - By default, the SQL statement continues to run if the timeout period is reached. To cancel a SQL statement if the timeout period is reached instead, add
"on_wait_timeout":"CANCEL"
to the request paylod.
If the statement’s result is available before the wait timeout ends, the response is as follows:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "JSON_ARRAY", "schema": { "column_count": 3, "columns": [ { "name": "l_orderkey", "type_name": "LONG", "position": 0 }, { "name": "l_extendedprice", "type_name": "DECIMAL", "position": 1, "type_precision": 18, "type_scale": 2 }, { "name": "l_shipdate", "type_name": "DATE", "position": 2 } ] } }, "result": { "chunk_index": 0, "row_offset": 0, "row_count": 2, "data_array": [ [ "15997987", "66516.00", "1992-02-12" ], [ "15997988", "53460.96", "1994-05-31" ] ] }}
If the wait timeout ends before the statement’s result is available, the response looks like this instead:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "PENDING" }}
If the statement’s result data is too large (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 3000000
), the result data is chunked and looks like this instead. Note that "...": "..."
indicates omitted results here for brevity:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "JSON_ARRAY", "schema": { "column_count": 3, "columns": [ { "...": "..." } ] } }, "result": { "chunk_index": 0, "row_offset": 0, "row_count": 432500, "next_chunk_index": 1, "next_chunk_internal_link": "/api/2.0/sql/statements/01ed92c5-3583-1f38-b21b-c6773e7c56b3/result/chunks/1?row_offset=432500", "data_array": [ [ "15997987", "66516.00", "1992-02-12" ], [ "..." ] ] }}
Step 2: Get a statement’s current execution status and data result as JSON
You can use a SQL statement’s ID to get that statement’s current execution status and, if the execution succeeded, that statement’s result. If you forget the statement’s ID, you can get it from the query history section of the Databricks SQL console, or by calling the Query History API 2.0. For example, you could keep polling this command, checking each time to see if the execution has succeeded.
To get a SQL statement’s current execution status and, if the execution succeeded, that statement’s result and an API URL fragment for getting any next chunk of JSON data, run the following command. This command assumes that you have an environment variable on your local development machine named SQL_STATEMENT_ID
, which is set to the value of the ID of the SQL statement from the previous step or otherwise provided. Of course, you can substitute ${SQL_STATEMENT_ID}
in the following command with the hard-coded ID of the SQL statement.
curl --request GET \https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json' \&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
If the NEXT_CHUNK_INTERNAL_LINK
is set to a non-null
value, you can use it to get the next chunk of data, and so on, for example with the following command:
curl --request GET \https://${DATABRICKS_HOST}${NEXT_CHUNK_INTERNAL_LINK} \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json' \&& export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \&& echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK
You can keep running the preceding command, over and over again, to get the next chunk and so on. Note that as soon as the last chunk is fetched, the SQL statement is closed. After this closure, you cannot use that statement’s ID to get its current status or to fetch any more chunks.
Step 3: Fetch large results using external links
This section demonstrates an optional configuration that uses the EXTERNAL_LINKS
disposition to retrieve large data sets. The default location (disposition) for the SQL statement result data is within the response payload, but these results are limited to 16 MiB. By setting the disposition
to EXTERNAL_LINKS
, the response contains URLs you can use to fetch the chunks of the results data with standard HTTP. The URLs point to your workspace’s internal DBFS, where the result chunks are temporarily stored.
Warning
Databricks strongly recommends that you protect the URLs and tokens that are returned by the EXTERNAL_LINKS
disposition.
When you use the EXTERNAL_LINKS
disposition, a shared access signature (SAS) URL is generated, which can be used to download the results directly from Azure storage. As a short-lived SAS token is embedded within this SAS URL, you should protect both the SAS URL and the SAS token.
Because SAS URLs are already generated with embedded temporary SAS tokens, you must not set an Authorization
header in the download requests.
The EXTERNAL_LINKS
disposition can be disabled upon request. To make this request, create a support case. See _.
See also Security best practices.
Note
The response payload output format and behavior, once they are set for a particular SQL statement ID, cannot be changed.
In this mode, the API enables you to store result data only in Apache Arrow format that must be queried separately with HTTP. Also, when using this mode, it is not possible to inline the result data within the response payload.
The following command demonstrates using EXTERNAL_LINKS
. Use this pattern instead of the similar query demonstrated in step 1:
curl --request POST \https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--header "Content-Type: application/json" \--data '{ "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'", "catalog": "samples", "schema": "tpch", "format": "ARROW_STREAM", "disposition": "EXTERNAL_LINKS", "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 100000"}' \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json' \&& export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \&& echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID
The response is as follows:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "ARROW_STREAM", "schema": { "column_count": 3, "columns": [ { "name": "l_orderkey", "type_name": "LONG", "position": 0 }, { "name": "l_extendedprice", "type_name": "DECIMAL", "position": 1, "type_precision": 18, "type_scale": 2 }, { "name": "l_shipdate", "type_name": "DATE", "position": 2 } ] }, "total_chunk_count": 1, "chunks": [ { "chunk_index": 0, "row_offset": 0, "row_count": 100000 } ], "total_row_count": 100000, "total_byte_count": 2848312 }, "result": { "external_links": [ { "chunk_index": 0, "row_offset": 0, "row_count": 100000, "external_link": "<url-to-data-stored-externally>", "expiration": "<url-expiration-timestamp>" } ] }}
If the request times out, the response looks like this instead:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "PENDING" }}
To get that statement’s current execution status and, if the execution succeeded, that statement’s result, run the following command:
curl --request GET \https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json'
If the response is large enough (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem
with no row limit), the response will have multiple chunks, as in the following example below. Note that "...": "..."
indicates omitted results here for brevity:
{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "ARROW_STREAM", "schema": { "column_count": 3, "columns": [ { "...": "..." } ] }, "total_chunk_count": 43, "chunks": [ { "chunk_index": 0, "row_offset": 0, "row_count": 737500 }, { "chunk_index": 1, "row_offset": 737500, "row_count": 737500 }, { "...": "..." }, { "chunk_index": 41, "row_offset": 28755596, "row_count": 737500 }, { "chunk_index": 42, "row_offset": 29493096, "row_count": 506699 } ], "total_row_count": 29999795, "total_byte_count": 854409376 }, "result": { "external_links": [ { "chunk_index": 0, "row_offset": 0, "row_count": 737500, "next_chunk_index": 1, "next_chunk_internal_link": "/api/2.0/sql/statements/01ed92c5-3583-1f38-b21b-c6773e7c56b3/result/chunks/1?row_offset=737500", "external_link": "<url-to-result-stored-externally>", "expiration": "<expiration-timestamp-for-url>" } ] }}
To download the stored content’s results, you can run the following curl
command, using the URL in the external_link
object and specifying where you want to download the file. Do not include your Azure Databricks token in this command:
curl "<url-to-result-stored-externally>" \--output "<path/to/download/the/file/locally>"
To download a specific chunk of a streamed content’s results, first use the chunk_index
and row_offset
values from the response payload. For example, to get the chunk with a chunk_index
of 41
and a row_offset
of 28755596
from the previous response, run the following command:
curl --request GET \https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/result/chunks/41?row_offset=28755596 \--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \--output 'sql-execution-response.json' \&& jq . 'sql-execution-response.json'
Note
Running the preceding command returns a new SAS URL.
Arbitrary row_offset
values cannot be specified. You can only use row_offset
values that are contained within the manifest
object.
To download the stored chunk, use the URL in the external_link
object.
For more information about the Apache Arrow format, see:
Step 4: Cancel a SQL statement’s execution
If you need to cancel a SQL statement that has not yet succeeded, run the following command:
curl --request POST \https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
Security best practices
The Databricks SQL Statement Execution API increases the security of data transfers by using end-to-end transport layer security (TLS) encryption and short-lived credentials such as SAS tokens.
There are several layers in this security model. At the transport layer, it is only possible to communicate with the Databricks SQL Statement Execution API by using TLS 1.2 or above. Also, callers of the Databricks SQL Statement Execution API must be authenticated with a valid Azure Databricks personal access token or Azure Active Directory (Azure AD) token that maps to a user who has the entitlement to use Databricks SQL. This user must have Can Use access for the specific SQL warehouse that is being used, and access can be restricted with IP access lists. This applies to all requests to the Databricks SQL Statement Execution API. Furthermore, for executing statements, the authenticated user must have permission to the data objects (such as tables, views, and functions) that are used in each statement. This is enforced by existing access control mechanisms in Unity Catalog or by using table ACLs; see the Data governance guide. This also means that only the user who executes a statement can make fetch requests for the statement’s results.
Databricks recommends the following security best practices whenever you use the Databricks SQL Statement Execution API along with the EXTERNAL_LINKS
disposition to retrieve large data sets:
- Remove the Databricks authorization header for Azure storage requests
- Protect SAS URLs and SAS tokens
The EXTERNAL_LINKS
disposition can be disabled upon request by creating a support case. To make this request, create a support case. See _.
Remove the Databricks authorization header for Azure storage requests
All calls to the Databricks SQL Statement Execution API must include an Authorization
header that contains Azure Databricks access credentials. Do not include this Authorization
header whenever you download data from Azure storage. This header is not required and might unintentionally expose your Azure Databricks access credentials.
Protect SAS URLs and SAS tokens
Whenever you use the EXTERNAL_LINKS
disposition, a short-lived SAS URL is generated, which the caller can use to download the results directly from Azure storage by using TLS. As a short-lived SAS token is embedded within this SAS URL, you should protect both the SAS URL and SAS token.