Alarm Management

CnosDBAbout 6 minAbout 1672 words

Tips

Only Enterprise Edition supports, Please contact CC to get the alarm plugin.

Alarm Management

Introduction

CnosDB supports alarm management. Through CnosDB alarm management, you can view alarm information, set alarm notification methods, set alarm rules, set alarm notification groups, etc.

Principle of Implementation

For the time series data stored in CnosDB, cnos-alert component executes sql query regularly according to the configuration file submitted by the user, compares the query result with the threshold, and sends the query result that triggers the alarm to the user specified receiving terminal.

sql query: The standard cnosdb-sql query statement, considering the use scenario of the alarm, generally has a time related where clause. Threshold: In the configuration, you need to specify a field of the sql query return value, and set a threshold value for this field to trigger the alarm. Currently, it supports five types of thresholds: greater than, less than, equal to, within the range, and outside the range. Notification receiving terminal: Currently, slack and twitter are supported. History: All query results, issued notifications that trigger alarms will be recorded in cnosdb. The alarm rules configured by the user are recorded in the user-specified location json file. 告警组件原理

Start

./alertserver --config=alertserver.yaml --serverport=9001

Configuration(alertserver.yaml)

query: #cnosdb configuration where the queried data resides
    nodeHost: 127.0.0.1
    nodePort: 8902
    authorization: ********* #Only supports base64 encrypted username and password
alert: #alarm configuration
    filePath: /etc/alert.json
store: #cnosdb configuration where the alarm record is stored
    nodeHost: 127.0.0.1
    nodePort: 8902
    authorization: ********* #Only supports base64 encrypted username and password
    alerttable:  alertrecord #Alarm record table name
    notitable: notirecord #Notification record table name

API Description

/api/http/ping

Description

Test the running status of the service

Request Method

  • GET

Request Example

curl -X GET http:/127.0.0.1:30001/api/http/ping

Request Succeeded

{"message":"ok"}

Request Failed

curl: error
/api/v1/alert/config/rule

Description

Create an alert rule.

Request Method

  • POST

Request Example

curl -X POST http:/127.0.0.1:30001/api/v1/alert/config/rule

Request Parameters

{
    "tenant": "cnosdb", # tenant name where the queried data resides
    "data": {
        "enabled": "on", # initial execution status of the alert rule, ["on", "off"]
        "dbname": "public", #  database name where the queried data resides
        "sqlType": 1, # select sql type, local recommended 1, use the complete sql in sqlCmd parameter for query
        "sqlCmd": "select cpu, avg(usage_user) from cpu where time >= now() - interval '20' SECOND group by cpu", # 与sqlType 1配合使用
        "period": "15s", # query execution period, composed of numbers + ['s', 'm', 'h', 'd']
        "thresholds": [ # threshold
            {
                "checks": [ # check
                    {
                        "value": "0.2", # a comparison value
                        "operator": 1 # a comparison operator, -2: less than or equal to, -1: less than, 0: equal to, 1: greater than, 2: greater than or equal to
                    }
                ],
                "period": "Automation", # notification task execution period, ['Automation', 'Hourly', 'Daily', 'Weekly'], where Automation is once a minute
                "severity": "Medium", # alarm level
                "endtools": [ # receiving terminal
                    {
                        "name": "slack", # terminal name
                        "receiver": "https://hooks.slack.com/services/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", # slack requires users to provide webhookurl, and twitter requires users to provide verification key
                        "format": "", # format of the notification content, please refer to the following example for details
                        "tool": "slack" # terminal type
                    }
                ],
                "checkrelation": 0 # checks can have multiple, and the logical relationship between checks is determined by relation, 0 is or operation, 1 is and operation
            }
        ],
        "name": "cpu new", # rule name
        "description": "cpu local demo", # description
        "field": "AVG(cpu.usage_user)", # field used to compare the query result
    }
}

Request Succeeded

{ 
    "message":"succeed", 
    "id":"1"
}

Request Failed

{
    "code":3, 
    "message":"invalid character '}' looking for beginning of object key string", 
    "details":[]
}
/api/v1/alert/config/rule

Description

Modify an alert rule.

Request Method

  • PUT

Request Example

curl -X PUT http:/127.0.0.1:30001/api/v1/alert/config/rule

Request Parameters

{
    "id": 1, # rule id
    "tenant": "cnosdb", # tenant name where the queried data resides
    "data": {
        "enabled": "on", # initial execution status of the alert rule, ["on", "off"]
        "dbname": "public", #  database name where the queried data resides
        "sqlType": 1, # select sql type, local recommended 1, use the complete sql in sqlCmd parameter for query
        "sqlCmd": "select cpu, avg(usage_user) from cpu where time >= now() - interval '20' SECOND group by cpu", # 与sqlType 1配合使用
        "period": "15s", # query execution period, composed of numbers + ['s', 'm', 'h', 'd']
        "thresholds": [ # threshold
            {
                "checks": [ # check
                    {
                        "value": "0.2", # a comparison value
                        "operator": 1 # a comparison operator, -2: less than or equal to, -1: less than, 0: equal to, 1: greater than, 2: greater than or equal to
                    }
                ],
                "period": "Automation", # notification task execution period, ['Automation', 'Hourly', 'Daily', 'Weekly'], where Automation is once a minute
                "severity": "Medium", # alarm level
                "endtools": [ # receiving terminal
                    {
                        "name": "slack", # terminal name
                        "receiver": "https://hooks.slack.com/services/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", # slack requires users to provide webhookurl, and twitter requires users to provide verification key
                        "format": "", # format of the notification content, please refer to the following example for details
                        "tool": "slack" # terminal type
                    }
                ],
                "checkrelation": 0 # checks can have multiple, and the logical relationship between checks is determined by relation, 0 is or operation, 1 is and operation
            }
        ],
        "name": "cpu new", # rule name
        "description": "cpu local demo", # Description
        "field": "AVG(cpu.usage_user)", # field used to compare the query result
    }
}

Request Succeeded

{ "message":"succeed" }

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
api/v1/alert/config/rule/tenant/:tenant/id/:id

Description

Get the specified rule information.

Request Method

  • GET

Request Example

curl -X GET http:/127.0.0.1:30001/api/v1/alert/config/rule/tenant/cnosdb/id/1

Request Parameters

    :tenant: tenant name 
    :id: rule id

Request Succeeded

{
    "id": "1",
    "data":{
        "enabled": "off", 
        "dbname": "public", 
        "sql": null, 
        "period": "15s", 
        "thresholds": [
            {
                "checks": [
                    {
                        "value": "0.2", 
                        "operator": 1
                    }
                ], 
                "period": "Automation", 
                "severity": "Medium", 
                "endtools":[
                    {
                        "name": "slack", 
                        "receiver": "https://hooks.slack.com/services/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", "format": "", 
                        "tool": "slack"
                        }
                    ], 
                "checkrelation": 0
            }
        ], 
        "name": "cpu new", 
        "description": "cpu local demo", 
        "field": "AVG(cpu.usage_user)", 
        "create": "2023-08-17T10:45:02+08:00", 
        "latestupdate": "2023-08-17T11:24:43+08:00", 
        "lateststatus": "0", 
        "additionalRetrospectiveTime": "", 
        "sqlType": "1", 
        "sqlCmd": "select cpu, avg(usage_user) from cpu where time >= now() - interval '20' SECOND group by cpu"
        }
    "tenant": "cnosdb"
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
api/v1/alert/config/rule/tenant/:tenant/id/:id

Description

Remove the top rule.

Request Method

  • DELETE

Request Example

curl -X DELETE http:/127.0.0.1:30001/api/v1/alert/config/rule/tenant/cnosdb/id/1

Request Parameters

    :tenant: tenant name
    :id: rule id

Request Succeeded

{
    "message": "succeed"
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
/api/v1/alert/config/rule/tenant/:tenant

Description

List all rules for the specified tenant.

Request Method

  • GET

Request Example

curl -X DELETE http:/127.0.0.1:30001/api/v1/alert/config/rule/tenant/cnosdb?page=1&per_page=10

Request Parameters

    :tenant: tenant name
    page: page number
    per_page: number of records displayed per page

Request Succeeded

{
    "data":[
        {
            "name": "cpu new", # rule name
            "severity": "Medium", # rule level
            "lastrun": "2023-08-17T11:51:04+08:00", # last execution time of sql query
            "enabled": "on",  # rule status
            "laststatus": "0", # last execution status, 0 means failure, 1 means success
            "id": 2 # rule id
            }
        ], 
    "order": "name, severity, lastrun, laststatus, enabled", # local can be ignored
    "total": "1" # total number of rules under the tenant
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
/api/v1/alert/config/rule/tenant/:tenant

Description

List all rules for the specified tenant.

Request Method

  • GET

Request Example

curl -X DELETE http:/127.0.0.1:30001/api/v1/alert/config/rule/tenant/cnosdb?page=1&per_page=10

Request Parameters

    :tenant: tenant name
    page: page number
    per_page: number of records displayed per page

Request Succeeded

{
    "data":[
        {
            "name": "cpu new", # rule name
            "severity": "Medium", # rule level
            "lastrun": "2023-08-17T11:51:04+08:00", # last execution time of sql query
            "enabled": "on",  # rule status
            "laststatus": "0", # last execution status, 0 means failure, 1 means success
            "id": 2 # rule id
            }
        ], 
    "order": "name, severity, lastrun, laststatus, enabled", # local can be ignored
    "total": "1" # total number of rules under the tenant
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
api/v1/alert/data/alert/tenant/:tenant

Description

List all alert records for the specified tenant.

Request Method

  • GET

Request Example

curl -X DELETE http:/127.0.0.1:30001/api/v1/alert/data/alert/tenant/cnosdb?page=1&per_page=10

Request Parameters

    :tenant: tenant name
    page: page number
    per_page: number of records displayed per page

Request Succeeded

{
    "data": "[{\"enabled\":1,\"name\":\"cpu new\",\"severity\":\"Medium\",\"time\":\"2023-06-27T09:49:08.441665430\",\"value\":\"{\\\"AVG(cpu.usage_user)\\\":0.2001001001000161,\\\"cpu\\\":\\\"cpu2\\\"}\"}]", # alert record in json string
    "order": "time, name, severity, value, enabled",  # local can be ignored
    "total": "628" # alert record total number
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}
api/v1/alert/data/noti/tenant/:tenant

Description

Lists all notification records for the specified tenant.

Request Method

  • GET

Request Example

curl -X DELETE http:/127.0.0.1:30001/api/v1/alert/data/noti/tenant/cnosdb?page=1&per_page=10

Request Parameters

    :tenant: tenant name
    page: page number
    per_page: number of records displayed per page

Request Succeeded

{
    "data": "[{\"name\":\"cpu new\",\"send_status\":1,\"severity\":\"Medium\",\"time\":\"2023-06-27T09:27:08\",\"value\":\"{\\\"AVG(cpu.usage_user)\\\":0.20040080160339818,\\\"cpu\\\":\\\"cpu1\\\"}\\n{\\\"AVG(cpu.usage_user)\\\":0.3000000000020009,\\\"cpu\\\":\\\"cpu3\\\"}\\n{\\\"AVG(cpu.usage_user)\\\":0.20026912240306194,\\\"cpu\\\":\\\"cpu-total\\\"}\\n{\\\"AVG(cpu.usage_user)\\\":0.3501002004075616,\\\"cpu\\\":\\\"cpu0\\\"}\\n{\\\"AVG(cpu.usage_user)\\\":0.2999999999974534,\\\"cpu\\\":\\\"cpu2\\\"}\"}]", # notification record in json string
    "order": "time, name, severity, value, send_status", # local can be ignored
    "total": "35" # notification record total number
}

Request Failed

{
    "code": error id, 
    "message": error string, 
    "details":[]
}

Example

Suppose we write the cpu monitoring data to cnosdb through telegraf tool. Part of the table is as follows:

public ❯ select time, cpu, usage_user from cpu order by time desc limit 5;
+---------------------+-----------+---------------------+
| time                | cpu       | usage_user          |
+---------------------+-----------+---------------------+
| 2023-07-04T08:17:50 | cpu0      | 0.0                 |
| 2023-07-04T08:17:50 | cpu1      | 0.6012024047821427  |
| 2023-07-04T08:17:50 | cpu2      | 0.0                 |
| 2023-07-04T08:17:50 | cpu3      | 0.20040080160339818 |
| 2023-07-04T08:17:50 | cpu-total | 0.2503128911078006  |
+---------------------+-----------+---------------------+

This table logs cpu data every 10 seconds, and we want to monitor the usage_user value for each cpu in the table and send an alert to slack when it averages greater than 0.2 over the past minute.

Create Rule

curl --location 'http://localhost:30001/api/v1/alert/config/rule' \
--header 'Content-Type: application/json' \
--data '{
    "tenant": "cnosdb",
    "data": {
        "enabled": "on",
        "dbname": "public",
        "sqlType": 1,
        "sqlCmd": "select cpu, avg(usage_user) from cpu where time >= now() - interval '\''20'\'' SECOND group by cpu",
        "period": "15s",
        "thresholds": [
            {
                "checks": [
                    {
                        "value": "0.2",
                        "operator": 1
                    }
                ],
                "period": "Automation",
                "severity": "Medium",
                "endtools": [
                    {
                        "name": "slack",
                        "receiver": "https://hooks.slack.com/services/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",
                        "format": "{{dbname}}{{sql}}{{name}}{{period}}{{description}}{{threshold}}",
                        "tool": "slack"
                    }
                ],
                "checkrelation": 0
            }
        ],
        "name": "cpu new",
        "description": "cpu local demo",
        "field": "AVG(cpu.usage_user)",
        "additionalRetrospectiveTime": "5s"
    }
}'

Check for incoming notifications in Slack