Information scientists working in Python or R usually purchase knowledge by means of REST APIs. Each environments present libraries that allow you to make HTTP calls to REST endpoints, then remodel JSON responses into dataframes. However that’s by no means so simple as we’d like. Once you’re studying a variety of knowledge from a REST API, you could do it a web page at a time, however pagination works in another way from one API to the subsequent. So does unpacking the ensuing JSON buildings. HTTP and JSON are low-level requirements, and REST is a loosely-defined framework, however nothing ensures absolute simplicity, by no means thoughts consistency throughout APIs.
What if there have been a means of studying from APIs that abstracted all of the low-level grunt work and labored the identical means in every single place? Excellent news! That’s precisely what Steampipe does. It’s a software that interprets REST API calls instantly into SQL tables. Listed here are three examples of questions you could ask and reply utilizing Steampipe.
1. Twitter: What are latest tweets that point out PySpark?
Right here’s a SQL question to ask that query:
choose
id,
textual content
from
twitter_search_recent
the place
question = 'pyspark'
order by
created_at desc
restrict 5;
Right here’s the reply:
+---------------------+------------------------------------------------------------------------------------------------>
| id | textual content >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡ >
| | >
| | When to make use of what StorageLevel for Cache / Persist? >
| | >
| | StorageLevel decides how and the place knowledge needs to be s… >
| 1526322757880848385 | Resolve challenges and exceed expectations with a profession as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark >
| | >
| | https://t.co/qf0gIvNmyx >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Grasp Massive Information With PySpark and AWS >
| | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode >
| | -> http… >
+---------------------+------------------------------------------------------------------------------------------------>
The desk that’s being queried right here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/latest endpoint and formulates it as a desk with these columns. You don’t must make an HTTP name to that API endpoint or unpack the outcomes, you simply write a SQL question that refers back to the documented columns. A type of columns, question
, is particular: it encapsulates Twitter’s question syntax. Right here, we’re simply searching for tweets that match PySpark however we may as simply refine the question by pinning it to particular customers, URLs, sorts (is:retweet
, is:reply
), properties (has:mentions
, has_media
), and so forth. That question syntax is similar regardless of the way you’re accessing the API: from Python, from R, or from Steampipe. It’s a lot to consider, and all it’s best to really want to know when crafting queries to mine Twitter knowledge.
2. GitHub: What are repositories that point out PySpark?
Right here’s a SQL question to ask that query:
choose
title,
owner_login,
stargazers_count
from
github_search_repository
the place
question = 'pyspark'
order by stargazers_count desc
restrict 10;
Right here’s the reply:
+----------------------+-------------------+------------------+
| title | owner_login | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML | microsoft | 3297 |
| spark-nlp | JohnSnowLabs | 2725 |
| incubator-linkis | apache | 2524 |
| ibis | ibis-project | 1805 |
| spark-py-notebooks | jadianes | 1455 |
| petastorm | uber | 1423 |
| awesome-spark | awesome-spark | 1314 |
| sparkit-learn | lensacom | 1124 |
| sparkmagic | jupyter-incubator | 1121 |
| data-algorithms-book | mahmoudparsian | 1001 |
+----------------------+-------------------+------------------+
This appears similar to the primary instance! On this case, the desk that’s being queried, github_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a desk with these columns.
In each instances the Steampipe documentation not solely reveals you the schemas that govern the mapped tables, it additionally offers examples (Twitter, GitHub) of SQL queries that use the tables in numerous methods.
Observe that these are simply two of many accessible tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.
3. Twitter + GitHub: What have house owners of PySpark-related repositories tweeted currently?
To reply this query we have to seek the advice of two totally different APIs, then be part of their outcomes. That’s even more durable to do, in a constant means, while you’re reasoning over REST payloads in Python or R. However that is the form of factor SQL was born to do. Right here’s one option to ask the query in SQL.
-- discover pyspark repos
with github_repos as (
choose
title,
owner_login,
stargazers_count
from
github_search_repository
the place
question = 'pyspark' and title ~ 'pyspark'
order by stargazers_count desc
restrict 50
),
-- discover twitter handles of repo house owners
github_users as (
choose
u.login,
u.twitter_username
from
github_user u
be part of
github_repos r
on
r.owner_login = u.login
the place
u.twitter_username shouldn't be null
),
-- discover corresponding twitter customers
choose
id
from
twitter_user t
be part of
github_users g
on
t.username = g.twitter_username
)
-- discover tweets from these customers
choose
t.author->>'username' as twitter_user,
'https://twitter.com/' || (t.author->>'username') || '/standing/' || t.id as url,
t.textual content
from
twitter_user_tweet t
be part of
twitter_userids u
on
t.user_id = u.id
the place
t.created_at > now()::date - interval '1 week'
order by
t.writer
restrict 5
Right here is the reply:
+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user | url | textual content >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech | https://twitter.com/idealoTech/standing/1524688985649516544 | Can you discover artistic soluti>
| | | >
| | | Be a part of our @codility Order #API Challe>
| | | >
| | | #idealolife #codility #php >
| idealoTech | https://twitter.com/idealoTech/standing/1526127469706854403 | Our #ProductDiscovery staff at idealo>
| | | >
| | | Assume you'll be able to resolve it? 😎 >
| | | ➡️ https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/standing/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
| | | What's new? You may examine the releas>
| | | >
| | | pip set up -U… >
| andfanilo | https://twitter.com/andfanilo/standing/1524999923665711104 | @edelynn_belle Thanks! Generally it >
| andfanilo | https://twitter.com/andfanilo/standing/1523676489081712640 | @juliafmorgado Good luck on the reco>
| | | >
| | | My recommendation: energy by way of it + a lifeless>
| | | >
| | | I hated my first few quick movies bu>
| | | >
| | | Wanting ahead to the video 🙂
When APIs frictionlessly develop into tables, you’ll be able to dedicate your full consideration to reasoning over the abstractions represented by these APIs. Larry Wall, the creator of Perl, famously mentioned: “Straightforward issues needs to be straightforward, arduous issues needs to be doable.” The primary two examples are issues that needs to be, and are, straightforward: every is simply 10 strains of straightforward, straight-ahead SQL that requires no wizardry in any respect.
The third instance is a more durable factor. It will be arduous in any programming language. However SQL makes it doable in a number of good methods. The answer is manufactured from concise stanzas (CTEs, Frequent Desk Expressions) that kind a pipeline. Every section of the pipeline handles one clearly-defined piece of the issue. You may validate the output of every section earlier than continuing to the subsequent. And you are able to do all this with probably the most mature and widely-used grammar for choice, filtering, and recombination of knowledge.
Do I’ve to make use of SQL?
No! In case you like the concept of mapping APIs to tables, however you’ll quite motive over these tables in Python or R dataframes, then Steampipe can oblige. Underneath the covers it’s Postgres, enhanced with international knowledge wrappers that deal with the API-to-table transformation. Something that may hook up with Postgres can hook up with Steampipe, together with SQL drivers like Python’s psycopg2
and R’s RPostgres
in addition to business-intelligence instruments like Metabase, Tableau, and PowerBI. So you should use Steampipe to frictionlessly devour APIs into dataframes, then motive over the information in Python or R.
However for those who haven’t used SQL on this means earlier than, it’s value a glance. Think about this comparability of SQL to Pandas from The way to rewrite your SQL queries in Pandas.
SQL | Pandas |
---|---|
choose * from airports | airports |
choose * from airports restrict 3 | airports.head(3) |
choose id from airports the place ident = ‘KLAX’ | airports[airports.ident == ‘KLAX’].id |
choose distinct kind from airport | airports.kind.distinctive() |
choose * from airports the place iso_region = ‘US-CA’ and kind = ‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)] |
choose ident, title, municipality from airports the place iso_region = ‘US-CA’ and kind = ‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]] |
We will argue the deserves of 1 fashion versus the opposite, however there’s no query that SQL is probably the most common and widely-implemented option to specific these operations on knowledge. So no, you don’t have to make use of SQL to its fullest potential to be able to profit from Steampipe. However you may discover that you simply need to.