AIPressRoom
Posts
Digest 15 Billion Logs Per Day and Maintain Massive Queries Inside 1 Second

Digest 15 Billion Logs Per Day and Maintain Massive Queries Inside 1 Second

This knowledge warehousing use case is about scale. The person is China Unicom, one of many world’s greatest telecommunication service suppliers. Utilizing Apache Doris, they deploy a number of petabyte-scale clusters on dozens of machines to assist their 15 billion day by day log additions from their over 30 enterprise traces. Such a huge log evaluation system is a part of their cybersecurity administration. For the necessity of real-time monitoring, risk tracing, and alerting, they require a log analytic system that may mechanically acquire, retailer, analyze, and visualize logs and occasion data.

From an architectural perspective, the system ought to be capable to undertake real-time evaluation of varied codecs of logs, and naturally, be scalable to assist the large and ever-enlarging knowledge dimension. The remainder of this publish is about what their log processing structure seems like, and the way they notice secure knowledge ingestion, low-cost storage, and fast queries with it.

That is an outline of their knowledge pipeline. The logs are collected into the info warehouse, and undergo a number of layers of processing.

ODS: Unique logs and alerts from all sources are gathered into Apache Kafka. In the meantime, a duplicate of them shall be saved in HDFS for knowledge verification or replay.
DWD: That is the place the very fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the info, and write it again to Kafka. These truth tables may even be put into Apache Doris, in order that Doris can hint a sure merchandise or use them for dashboarding and reporting. As logs will not be averse to duplication, the very fact tables shall be organized within the Duplicate Key model of Apache Doris.
DWS: This layer aggregates knowledge from DWD and lays the muse for queries and evaluation.
ADS: On this layer, Apache Doris auto-aggregates knowledge with its Mixture Key mannequin, and auto-updates knowledge with its Distinctive Key mannequin.

Structure 2.0 evolves from Structure 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the person’s wants for real-time knowledge processing and multi-table be part of queries. Of their expertise with the outdated structure, they discovered insufficient assist for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins.

Now let’s check out their observe in knowledge ingestion, storage, and queries with Structure 2.0.

Steady ingestion of 15 billion logs per day

Within the person’s case, their enterprise churns out 15 billion logs every single day. Ingesting such knowledge quantity rapidly and stably is an actual downside. With Apache Doris, the really helpful method is to make use of the Flink-Doris-Connector. It’s developed by the Apache Doris group for large-scale knowledge writing. The part requires easy configuration. It implements Stream Load and might attain a writing velocity of 200,000~300,000 logs per second, with out interrupting the info analytic workloads.

A lesson discovered is that when utilizing Flink for high-frequency writing, it is advisable discover the suitable parameter configuration in your case to keep away from knowledge model accumulation. On this case, the person made the next optimizations:

Flink Checkpoint: They improve the checkpoint interval from 15s to 60s to cut back writing frequency and the variety of transactions processed by Doris per unit of time. This could relieve knowledge writing strain and keep away from producing too many knowledge variations.
Knowledge Pre-Aggregation: For knowledge of the identical ID however comes from numerous tables, Flink will pre-aggregate it primarily based on the first key ID and create a flat desk, in an effort to keep away from extreme useful resource consumption brought on by multi-source knowledge writing.
Doris Compaction: The trick right here contains discovering the suitable Doris backend (BE) parameters to allocate the correct amount of CPU assets for knowledge compaction, setting the suitable variety of knowledge partitions, buckets, and replicas (an excessive amount of knowledge tablets will deliver enormous overheads), and dialing up max_tablet_version_num to keep away from model accumulation.

These measures collectively guarantee day by day ingestion stability. The person has witnessed secure efficiency and low compaction rating in Doris backend. As well as, the mix of knowledge pre-processing in Flink and the Unique Key model in Doris can guarantee faster knowledge updates.

Storage methods to cut back prices by 50%

The scale and era price of logs additionally impose strain on storage. Among the many immense log knowledge, solely part of it’s of excessive informational worth, so storage must be differentiated. The person has three storage methods to cut back prices.

ZSTD (ZStandard) compression algorithm: For tables bigger than 1TB, specify the compression technique as “ZSTD” upon desk creation, it’s going to notice a compression ratio of 10:1.
Tiered storage of cold and hot knowledge: That is supported by the new feature of Doris. The person units a knowledge “cooldown” interval of seven days. Which means knowledge from the previous 7 days (particularly, sizzling knowledge) shall be saved in SSD. As time goes by, sizzling knowledge “cools down” (getting older than 7 days), it is going to be mechanically moved to HDD, which is cheaper. As knowledge will get even “colder”, it is going to be moved to object storage for a lot decrease storage prices. Plus, in object storage, knowledge shall be saved with just one copy as an alternative of three. This additional cuts down prices and the overheads introduced by redundant storage.
Differentiated reproduction numbers for various knowledge partitions: The person has partitioned their knowledge by time vary. The precept is to have extra replicas for newer knowledge partitions and fewer for the older ones. Of their case, knowledge from the previous 3 months is often accessed, in order that they have 2 replicas for this partition. Knowledge that’s 3~6 months outdated has two replicas, and knowledge from 6 months in the past has one single copy.

With these three methods, the person has diminished their storage prices by 50%.

Differentiated question methods primarily based on knowledge dimension

Some logs have to be instantly traced and positioned, comparable to these of irregular occasions or failures. To make sure real-time response to those queries, the person has completely different question methods for various knowledge sizes:

Lower than 100G: The person makes use of the dynamic partitioning characteristic of Doris. Small tables shall be partitioned by date and enormous tables shall be partitioned by hour. This could keep away from knowledge skew. To additional guarantee stability inside a knowledge partition, they use the snowflake ID because the bucketing subject. Additionally they set a beginning offset. Knowledge of the latest 20 days shall be stored. That is the stability level between knowledge backlog and analytic wants.
100G~1T: These tables have their materialized views, that are the pre-computed end result units saved in Doris. Thus, queries on these tables shall be a lot quicker and fewer resource-consuming. The DDL syntax of materialized views in Doris is similar as these in PostgreSQL and Oracle.
Greater than 100T: These tables are put into the Mixture Key mannequin of Apache Doris and pre-aggregate them. On this method, we allow queries of two billion log data to be finished in 1~2s.

These methods have shortened the response time of queries. For instance, a question of a particular knowledge merchandise used to take minutes, however now it may be completed in milliseconds. As well as, for large tables that comprise 10 billion knowledge data, queries on completely different dimensions can all be finished in a couple of seconds.

The person is now testing with the newly added inverted index in Apache Doris. It’s designed to hurry up full-text search of strings in addition to equivalence and vary queries of numerics and datetime. They’ve additionally offered their useful suggestions concerning the auto-bucketing logic in Doris: Presently, Doris decides the variety of buckets for a partition primarily based on the info dimension of the earlier partition. The issue for the person is, most of their new knowledge is available in throughout daytime, however little at nights. So of their case, Doris creates too many buckets for evening knowledge however too few in daylight, which is the alternative of what they want. They hope so as to add a brand new auto-bucketing logic, the place the reference for Doris to resolve the variety of buckets is the info dimension and distribution of the day before today. They’ve come to the Apache Doris community and we at the moment are engaged on this optimization. Zaki Lu is a former product supervisor at Baidu and now DevRel for the Apache Doris open supply group.

The post Digest 15 Billion Logs Per Day and Maintain Massive Queries Inside 1 Second appeared first on AIPressRoom.