产品定价 立即试用
PE MQTT Broker
文档 > 故障排查
入门
安装 架构 API 常见问题
目录

故障排查

排查工具与技巧

Kafka队列:Consumer Group消息滞后

可使用下文所示日志识别消息处理或TBMQ基础设施其他部分的问题。 由于Kafka用于MQTT消息处理及系统其他主要部分(如 client sessionsclient subscriptionsretained messages 等), 可据此分析broker的整体状态。

TBMQ支持监控向Kafka生产消息的速率是否快于消费与处理的速率。 若存在该情况,消息处理延迟将逐渐增大。 要启用此功能,请确保开启Kafka consumer-stats(参见配置属性中的 queue.kafka.consumer-stats 节)。

开启Kafka consumer-stats后,将生成consumer group的offset lag相关日志(参见故障排查)。

以下为日志示例:

1
2022-11-27 02:33:23,625 [kafka-consumer-stats-1-thread-1] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[msg-all-consumer-group] Topic partitions with lag: [[topic=[tbmq.msg.all], partition=[2], lag=[5]]].

由此可知有5条消息已推入 tbmq.msg.all 主题但尚未处理。

一般而言,日志结构如下:

1
TIME [STATS_PRINTING_THREAD_NAME] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[CONSUMER_GROUP_NAME] Topic partitions with lag: [[topic=[KAFKA_TOPIC], partition=[KAFKA_TOPIC_PARTITION], lag=[LAG]],[topic=[ANOTHER_TOPIC], partition=[], lag=[]],...].

其中:

  • CONSUMER_GROUP_NAME - 正在处理消息的消费者组名称。
  • KAFKA_TOPIC - Kafka主题名称。
  • KAFKA_TOPIC_PARTITION - 主题分区号。
  • LAG - 未处理消息数量。

注意:仅当该消费者组存在滞后时才打印consumer lag相关日志。

CPU/内存使用

有时问题源于某服务的资源不足。可登录 server/container/pod 并执行 top 命令查看CPU和内存使用情况。

为便于监控,建议配置Prometheus和Grafana。

若发现某服务CPU使用率时常达到100%,可通过在集群中创建新节点进行水平扩展,或通过增加CPU总量进行垂直扩展。

日志

查看日志

无论部署类型如何,TBMQ日志均存储在以下目录:

1
/var/log/thingsboard-mqtt-broker

不同部署工具提供不同的日志查看方式:

运行时查看最新日志:

1
docker compose logs -f tbmq-1 tbmq-2
文档信息图标

若仍使用带连字符的 docker-compose,请执行:

docker-compose logs -f tbmq-1 tbmq-2

可使用 grep 仅显示包含指定字符串的输出。 例如,检查 backend 侧是否有错误:

1
docker compose logs tbmq-1 tbmq-2 | grep ERROR
文档信息图标

若仍使用带连字符的 docker-compose,请执行:

docker-compose logs tbmq-1 tbmq-2 | grep ERROR

提示: 可将日志重定向到文件再用任意文本编辑器分析:

1
docker compose logs -f tbmq-1 tbmq-2 > tbmq.log
文档信息图标

若仍使用带连字符的 docker-compose,请执行:

docker-compose logs -f tbmq-1 tbmq-2 > tbmq.log

说明: 也可进入 TBMQ 容器内查看日志:

1
2
docker ps
docker exec -it NAME_OF_THE_CONTAINER bash

查看集群中所有 pods:

1
kubectl get pods

查看指定 pod 的最新日志:

1
kubectl logs -f POD_NAME

查看 TBMQ 日志:

1
kubectl logs -f tb-broker-0

可使用 grep 仅显示包含指定字符串的输出。 例如,检查 backend 侧是否有错误:

1
kubectl logs -f tb-broker-0 | grep ERROR

若有多节点,可将各节点日志重定向到本机文件再分析:

1
2
kubectl logs -f tb-broker-0 > tb-broker-0.log
kubectl logs -f tb-broker-1 > tb-broker-1.log

说明: 也可进入 TBMQ 容器内查看日志:

1
2
kubectl exec -it tb-broker-0 -- bash
cat /var/log/thingsboard-mqtt-broker/thingsboard-mqtt-broker.log

启用特定日志

为便于故障排查,TBMQ允许用户为系统特定部分启用或禁用日志。可通过修改以下目录中的 logback.xml 文件实现:

1
/usr/share/thingsboard-mqtt-broker/conf

注意,k8sDocker 部署使用不同的配置文件。

以下为 logback.xml 配置示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">

    <appender name="fileLogAppender"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.log</file>
        <rollingPolicy
                class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>3GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{ISO8601} [%thread] %-5level %logger{36}-%msg%n</pattern>
        </encoder>
    </appender>

    <logger name="org.thingsboard.mqtt.broker.actors.client.service.connect" level="TRACE"/>
    <logger name="org.thingsboard.mqtt.broker.actors.client.service.disconnect.DisconnectServiceImpl" level="INFO"/>
    <logger name="org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem" level="OFF"/>

    <root level="INFO">
        <appender-ref ref="fileLogAppender"/>
    </root>
</configuration>

配置文件中的 loggers 对故障排查最为有用,可为特定类或类组启用或禁用日志。上述示例中,默认日志级别为 INFO,即日志包含一般信息、警告和错误。对于 org.thingsboard.mqtt.broker.actors.client.service.connect 包则启用最详细级别。也可通过 OFF 级别完全禁用某部分的日志,如 org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem

要为系统某部分启用或禁用日志,需添加相应的 </logger> 配置并等待最多10秒。

不同部署工具的日志更新方式不同:

docker-compose 部署时,我们将 /config 目录映射到本地 ./tb-mqtt-broker/conf。 修改日志配置需编辑 ./tb-mqtt-broker/conf/logback.xml 文件。

Kubernetes 部署使用 ConfigMap 为 tb-brokers 提供 logback 配置。 修改 logback.xml 需编辑 tb-broker-configmap.yml 并执行:

1
kubectl apply -f tb-broker-configmap.yml

约 10 秒后,日志配置将生效。

指标

在TBMQ中启用Prometheus指标需:

  • STATS_ENABLED环境变量设为true
  • 在配置文件中将METRICS_ENDPOINTS_EXPOSE环境变量设为prometheus

此后可通过 https://<yourhostname>/actuator/prometheus 访问指标,并由Prometheus抓取(无需认证)。

Prometheus指标

TBMQ中的Spring Actuator可通过Prometheus暴露部分内部状态指标。

以下是TBMQ推送到Prometheus的指标列表:

TBMQ专用指标:

-incomingPublishMsg_published(统计名称-totalMsgs, successfulMsgs, failedMsgs):有关待在通用队列中持久化的传入发布消息的统计。 -incomingPublishMsg_consumed (statsNames-totalMsgs, successfulMsgs, timeoutMsgs, failedMsgs, tmpTimeout, tmpFailed, successfulIterations, failedIterations): stats about incoming Publish messages processing from general queue. -deviceProcessor (statsNames-successfulMsgs, failedMsgs, tmpFailed, successfulIterations, failedIterations): stats about DEVICE client messages processing. Some stats descriptions: -failedMsgs: number of failed messages to be persisted in database and were discarded afterwards-tmpFailed: number of failed messages to be persisted in database and got reprocessed later-appProcessor (statsNames-successfulPublishMsgs, successfulPubRelMsgs, tmpTimeoutPublish, tmpTimeoutPubRel, timeoutPublishMsgs, timeoutPubRelMsgs, successfulIterations, failedIterations): stats about APPLICATION client messages processing. Some stats descriptions: -tmpTimeoutPubRel: number of PubRel messages that timed out and got reprocessed later-tmpTimeoutPublish: number of Publish messages that timed out and got reprocessed later-timeoutPubRelMsgs: number of PubRel messages that timed out and were discarded afterwards-timeoutPublishMsgs: number of Publish messages that timed out and were discarded afterwards-failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully-appProcessor_latency (statsNames-puback, pubrec, pubcomp): stats about APPLICATION processor latency of different message types. -actors_processing (statsNames-MQTT_CONNECT_MSG, MQTT_PUBLISH_MSG, MQTT_PUBACK_MSG, etc.): stats about actors processing average time of different message types. -clientSubscriptionsConsumer (statsNames-totalSubscriptions, acceptedSubscriptions, ignoredSubscriptions): stats about the client subscriptions read from Kafka by the broker node. Some stats descriptions: -totalSubscriptions: total number of new subscriptions added to the broker cluster-acceptedSubscriptions: number of new subscriptions persisted by the broker node ignoredSubscriptions: number of ignored subscriptions since they were already initially processed by the broker node retainedMsgConsumer(统计名称-totalRetainedMsgs, newRetainedMsgs, clearedRetainedMsgs):有关保留消息处理的统计。 -subscriptionLookup:有关客户端订阅在trie数据结构中查询平均时间的统计。 -retainedMsgLookup: stats about average time of retain messages lookup in trie data structure. -clientSessionsLookup: stats about average time of client sessions lookup from found client subscriptions for publish message. -notPersistentMessagesProcessing: stats about average time for processing message delivery for not persistent clients. -persistentMessagesProcessing: stats about average time for processing message delivery for persistent clients. -delivery: stats about average time for message delivery to clients. -subscriptionTopicTrieSize: stats about client subscriptions count in trie data structure. -subscriptionTrieNodes: stats about client subscriptions nodes count in trie data structure. -retainMsgTrieSize: stats about retain message count in trie data structure. -retainMsgTrieNodes: stats about retain message nodes count in trie data structure. -lastWillClients: stats about last will clients count. -connectedSessions: stats about connected sessions count. -connectedSslSessions: stats about connected via TLS sessions count. -allClientSessions: stats about all client sessions count. -clientSubscriptions: stats about client subscriptions count in the in-memory map. -retainedMessages: stats about retain messages count in the in-memory map. -activeAppProcessors: stats about active APPLICATION processors count. -activeSharedAppProcessors: stats about active APPLICATION processors count for shared subscriptions. -runningActors: stats about running actors count.

PostgreSQL专用指标:

-sqlQueue_InsertUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about updating unauthorized clients to the database. -sqlQueue_DeleteUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about removing unauthorized clients to the database. -sqlQueue_LatestTimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about latest historical stats persistence to the database. -sqlQueue_TimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about historical stats persistence to the database.

注意,为达到最佳性能,TBMQ对上述每个队列使用多个队列(线程)

获取帮助

若上述指南均无法解决您的问题,请联系ThingsBoard团队获取进一步帮助。

联系我们