Skip to main content

This Week in RisingWave #9

· 7 min read
xxchan
Tao Wu
Dylan
Bugen Zhao
StrikeW

This week, we have a lot of updates 😆! To name a few: functional index, proc_time(), int256, CREATE CONNECTION, and ... the beta release of RisingWave Cloud!

RisingWave Cloud Beta release 🚀

Announcing RisingWave Cloud Beta Release

Just one year after we open-sourced RisingWave Database, we're pleased to announce another amazing achievement: the beta release of RisingWave Cloud, a hosted service that provides out-of-the-box RisingWave services. 🫡

To celebrate it, we have a featured event next week:

Rising RisingWave: A Look Back at the Journey to Build Distributed SQL Streaming Database

Don't miss the chance to learn how we got started and where we are now!

Features Updates 🌟

Functional Index

feat(frontend): support functional indexes creation by chenzl25 #8976

A functional index allows you to index the results of a function or expression, instead of just the raw data in a table. One interesting use case for functional indexes is indexing a field of a json type column.

Here is an example of how we use functional indexes.

-- Create a table with a jsonb column and insert some data.
create table t (j jsonb);
INSERT INTO t VALUES ('{"k": "abc"}');

-- Create a functional index.
create index idx on t(j->>'k');

-- Run a query on the functional index. The optimizer will
-- automatically choose the correct index to retrieve data
-- for your query.
select * from t where j->>'k' = 'abc';

Take a look at the query plan by using a explain statement. We can see that the query will access the index only. The predicate j->>'k' = 'abc' in the query is recognized by the optimizer when it does an optimization called index selection where the optimizer will try to replace an original plan accessing table t with a more efficient plan accessing the index idx.

explain select * from t where j->>'k' = 'abc';
QUERY PLAN
-----------------------------------------------------------------------------------------
BatchExchange { order: [], dist: Single }
└─BatchScan { table: idx, columns: [j], scan_ranges: [JSONB_ACCESS_STR = Utf8("abc")] }

proc_time()

feat: introduce proc_time() by yuhao-su #9088

Time is a critical concept in stream processing. However, it can sometimes be difficult to understand. You may have heard terms like "event time" and "processing time". As we work on advanced features of stream processing like watermark, We are adding processing time support (which is the time a record is processed in RisingWave) to enable features such as temporal joins (see RFC here). But don't worry, you don’t have to understand the subtle differences of time in most cases. As a streaming database, RisingWave can make things easier for you!

Int256

We are adding an experimental feature: an int256 datatype.

PostgreSQL supports 16, 32, and 64-bit integers. Some users have requested the addition of 256-bit integers, as they are a fundamental type in the blockchain world.

Fun fact 1: PostgreSQL calls 16/32/64-bit integers int2/4/8 (the number of bytes), so if we use the same convention, 256-bit integer will be called as int32, which is obviously confusing 🤣.

Fun fact 2: An alternative for int256 is an arbitrary-precision decimal type, but it’s slow and almost no modern databases support it.

feat(source): Use a private link connection to create Kafka source by StrikeW #9119

Users in a cloud environment may have already set up an AWS MSK (cloud-hosted Kafka) service. However, they may encounter problems to connect to the MSK brokers when creating a source in RisingWave. This is because the user's MSK service is usually located in a different VPC than the RisingWave instance in the cloud, and the MSK will broadcast to its clients the internal IP addresses of brokers, which is unreachable from other VPCs.

To solve this issue, we leverage the AWS PrivateLink networking to establish a connection from RisingWave’s VPC to the user's VPC. Firstly, users need to set up an endpoint service to expose the MSK service. Then RisingWave will create an endpoint to access the exposed service.

We have introduced a new SQL command CREATE CONNECTION for this purpose. It creates a Connection database object, which can represent a PrivateLink connection in the cloud. With this connection, users can create a Kafka source to consume messages in an AWS MSK service.

Performance Optimizations 💪

Switching the hash algorithm for the in-memory cache

fix(streaming): use xxhash64 for hash key in cache by BugenZhao #9163

As a distributed streaming database, RisingWave dispatches data with consistent hashing to achieve parallel execution. In short, to decide which partition a record will be dispatched to, we first calculate the hash value by some columns (the “distribution key”) with CRC32, then take the least significant bits as the index to look up the scheduling mapping.

Here comes the problem: we also accidentally use the same hash algorithm for the in-memory cache of executors 😰. Since the records dispatched into the same partition already have some locality of their hash values, using the same algorithm for the intra-partition cache can lead to abnormal hash collisions, then a heavy K::Eq span in the flame graph. By replacing the cache’s algorithm and then making them orthogonal, we achieve a performance increase of ~20%.

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

Handle unrecognized fields in serde

refactor(config): check unrecognized fields during deserialization by BugenZhao #9156

Do you know this common tip for tolerating and recording unrecognized fields when deserializing with serde in Rust? Here it is:

struct Foo {
...

#[serde(default, flatten)]
pub unrecognized: HashMap<String, serde_json::Value>,
}

Since almost all structures can be parsed into a JSON object, all unrecognized fields can be automagically put into this flattened map 😄. Furthermore, if we want to customize the behavior for encountering an unrecognized field, we can also wrap the HashMap into a struct Unrecognized and put the logic in impl Deserialize. For example, we print a warning for each unrecognized field in this PR.

cargo audit

fix: fix cargo audit issues and add audit to ci by xxchan #8959

As RisingWave gets mature, we care a lot about things like stability and security. One RisingWave Slack community member reported that there’s a cargo audit security alert ⚠️:

Crate:     time
Version: 0.1.45
Title: Potential segfault in the time crate
Date: 2020-11-18
ID: RUSTSEC-2020-0071
URL: https://rustsec.org/advisories/RUSTSEC-2020-0071
Severity: 6.2 (medium)
Solution: Upgrade to >=0.2.23
Dependency tree:
time 0.1.45
└── chrono 0.4.24

This alert is actually quite commonly seen in many Rust projects. After some investigation, I confirmed that it's a false positive: chrono doesn't use the vulnerable functions, so it's not affected by RUSTSEC-2020-0071. Interestingly, chrono did have a similar issue (RUSTSEC-2020-0159), but it was already patched in version 0.4.20 by rewriting the vulnerable C function in Rust (see a detailed writeup here).

P.S. The false positive problem of advisory-db is an unsolved problem. See https://github.com/rustsec/advisory-db/issues/288#issuecomment-1229186835

New contributors

@GengTeng submitted his 2nd PR:

After his 3rd PR last week, @broccoliSpicy continues to actively participate in high-quality discussions 🫶. Starting from a small issue and then diving deeper into related ones is quite a good way to get involved in a large project!


Finally, welcome to join the RisingWave Slack community. Also check out the good first issue and help wanted issues if you want to join the development of an open source database system!

So much for this week. See you next week! 🤗