故障排查 | ThingsBoard PE MQTT Broker

排查工具与技巧
- Kafka队列：Consumer Group消息滞后
- CPU/内存使用
日志
- 查看日志
- 启用特定日志
指标
Prometheus指标
- TBMQ专用指标：
- PostgreSQL专用指标：
获取帮助

排查工具与技巧

Kafka队列：Consumer Group消息滞后

可使用下文所示日志识别消息处理或TBMQ基础设施其他部分的问题。由于Kafka用于MQTT消息处理及系统其他主要部分（如 client sessions、client subscriptions、retained messages 等），可据此分析broker的整体状态。

TBMQ支持监控向Kafka生产消息的速率是否快于消费与处理的速率。若存在该情况，消息处理延迟将逐渐增大。要启用此功能，请确保开启Kafka consumer-stats（参见配置属性中的 queue.kafka.consumer-stats 节）。

开启Kafka consumer-stats后，将生成consumer group的offset lag相关日志（参见故障排查）。

以下为日志示例：

2022-11-27 02:33:23,625 [kafka-consumer-stats-1-thread-1] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[msg-all-consumer-group] Topic partitions with lag: [[topic=[tbmq.msg.all], partition=[2], lag=[5]]].

由此可知有5条消息已推入 tbmq.msg.all 主题但尚未处理。

一般而言，日志结构如下：

TIME [STATS_PRINTING_THREAD_NAME] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[CONSUMER_GROUP_NAME] Topic partitions with lag: [[topic=[KAFKA_TOPIC], partition=[KAFKA_TOPIC_PARTITION], lag=[LAG]],[topic=[ANOTHER_TOPIC], partition=[], lag=[]],...].

其中：

CONSUMER_GROUP_NAME - 正在处理消息的消费者组名称。
KAFKA_TOPIC - Kafka主题名称。
KAFKA_TOPIC_PARTITION - 主题分区号。
LAG - 未处理消息数量。

注意：仅当该消费者组存在滞后时才打印consumer lag相关日志。

CPU/内存使用

有时问题源于某服务的资源不足。可登录 server/container/pod 并执行 top 命令查看CPU和内存使用情况。

为便于监控，建议配置Prometheus和Grafana。

若发现某服务CPU使用率时常达到100%，可通过在集群中创建新节点进行水平扩展，或通过增加CPU总量进行垂直扩展。

日志

查看日志

无论部署类型如何，TBMQ日志均存储在以下目录：

/var/log/thingsboard-mqtt-broker

不同部署工具提供不同的日志查看方式：

Docker-Compose Deployment

Kubernetes Deployment

运行时查看最新日志：

docker compose logs -f tbmq-1 tbmq-2

可使用 grep 仅显示包含指定字符串的输出。例如，检查 backend 侧是否有错误：

docker compose logs tbmq-1 tbmq-2 | grep ERROR

提示： 可将日志重定向到文件再用任意文本编辑器分析：

docker compose logs -f tbmq-1 tbmq-2 > tbmq.log

说明： 也可进入 TBMQ 容器内查看日志：

docker ps
docker exec -it NAME_OF_THE_CONTAINER bash

查看集群中所有 pods：

kubectl get pods

查看指定 pod 的最新日志：

kubectl logs -f POD_NAME

查看 TBMQ 日志：

kubectl logs -f tb-broker-0

可使用 grep 仅显示包含指定字符串的输出。例如，检查 backend 侧是否有错误：

kubectl logs -f tb-broker-0 | grep ERROR

若有多节点，可将各节点日志重定向到本机文件再分析：

kubectl logs -f tb-broker-0 > tb-broker-0.log
kubectl logs -f tb-broker-1 > tb-broker-1.log

说明： 也可进入 TBMQ 容器内查看日志：

kubectl exec -it tb-broker-0 -- bash
cat /var/log/thingsboard-mqtt-broker/thingsboard-mqtt-broker.log

启用特定日志

为便于故障排查，TBMQ允许用户为系统特定部分启用或禁用日志。可通过修改以下目录中的 logback.xml 文件实现：

/usr/share/thingsboard-mqtt-broker/conf

注意，k8s 和 Docker 部署使用不同的配置文件。

以下为 logback.xml 配置示例：

<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">

    <appender name="fileLogAppender"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.log</file>
        <rollingPolicy
                class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>3GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{ISO8601} [%thread] %-5level %logger{36}-%msg%n</pattern>
        </encoder>
    </appender>

    <logger name="org.thingsboard.mqtt.broker.actors.client.service.connect" level="TRACE"/>
    <logger name="org.thingsboard.mqtt.broker.actors.client.service.disconnect.DisconnectServiceImpl" level="INFO"/>
    <logger name="org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem" level="OFF"/>

    <root level="INFO">
        <appender-ref ref="fileLogAppender"/>
    </root>
</configuration>

配置文件中的 loggers 对故障排查最为有用，可为特定类或类组启用或禁用日志。上述示例中，默认日志级别为 INFO，即日志包含一般信息、警告和错误。对于 org.thingsboard.mqtt.broker.actors.client.service.connect 包则启用最详细级别。也可通过 OFF 级别完全禁用某部分的日志，如 org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem。

要为系统某部分启用或禁用日志，需添加相应的 </logger> 配置并等待最多10秒。

不同部署工具的日志更新方式不同：

Docker-Compose Deployment

Kubernetes Deployment

docker-compose 部署时，我们将 /config 目录映射到本地 ./tb-mqtt-broker/conf。修改日志配置需编辑 ./tb-mqtt-broker/conf/logback.xml 文件。

Kubernetes 部署使用 ConfigMap 为 tb-brokers 提供 logback 配置。修改 logback.xml 需编辑 tb-broker-configmap.yml 并执行：

kubectl apply -f tb-broker-configmap.yml

约 10 秒后，日志配置将生效。

指标

在TBMQ中启用Prometheus指标需：

将STATS_ENABLED环境变量设为true。
在配置文件中将METRICS_ENDPOINTS_EXPOSE环境变量设为prometheus。

此后可通过 https://<yourhostname>/actuator/prometheus 访问指标，并由Prometheus抓取（无需认证）。

Prometheus指标

TBMQ中的Spring Actuator可通过Prometheus暴露部分内部状态指标。

以下是TBMQ推送到Prometheus的指标列表：

TBMQ专用指标：

-incomingPublishMsg_published（统计名称-totalMsgs, successfulMsgs, failedMsgs）：有关待在通用队列中持久化的传入发布消息的统计。 -incomingPublishMsg_consumed (statsNames-totalMsgs, successfulMsgs, timeoutMsgs, failedMsgs, tmpTimeout, tmpFailed, successfulIterations, failedIterations): stats about incoming Publish messages processing from general queue. -deviceProcessor (statsNames-successfulMsgs, failedMsgs, tmpFailed, successfulIterations, failedIterations): stats about DEVICE client messages processing. Some stats descriptions: -failedMsgs: number of failed messages to be persisted in database and were discarded afterwards-tmpFailed: number of failed messages to be persisted in database and got reprocessed later-appProcessor (statsNames-successfulPublishMsgs, successfulPubRelMsgs, tmpTimeoutPublish, tmpTimeoutPubRel, timeoutPublishMsgs, timeoutPubRelMsgs, successfulIterations, failedIterations): stats about APPLICATION client messages processing. Some stats descriptions: -tmpTimeoutPubRel: number of PubRel messages that timed out and got reprocessed later-tmpTimeoutPublish: number of Publish messages that timed out and got reprocessed later-timeoutPubRelMsgs: number of PubRel messages that timed out and were discarded afterwards-timeoutPublishMsgs: number of Publish messages that timed out and were discarded afterwards-failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully-appProcessor_latency (statsNames-puback, pubrec, pubcomp): stats about APPLICATION processor latency of different message types. -actors_processing (statsNames-MQTT_CONNECT_MSG, MQTT_PUBLISH_MSG, MQTT_PUBACK_MSG, etc.): stats about actors processing average time of different message types. -clientSubscriptionsConsumer (statsNames-totalSubscriptions, acceptedSubscriptions, ignoredSubscriptions): stats about the client subscriptions read from Kafka by the broker node. Some stats descriptions: -totalSubscriptions: total number of new subscriptions added to the broker cluster-acceptedSubscriptions: number of new subscriptions persisted by the broker node ignoredSubscriptions: number of ignored subscriptions since they were already initially processed by the broker node retainedMsgConsumer（统计名称-totalRetainedMsgs, newRetainedMsgs, clearedRetainedMsgs）：有关保留消息处理的统计。 -subscriptionLookup：有关客户端订阅在trie数据结构中查询平均时间的统计。 -retainedMsgLookup: stats about average time of retain messages lookup in trie data structure. -clientSessionsLookup: stats about average time of client sessions lookup from found client subscriptions for publish message. -notPersistentMessagesProcessing: stats about average time for processing message delivery for not persistent clients. -persistentMessagesProcessing: stats about average time for processing message delivery for persistent clients. -delivery: stats about average time for message delivery to clients. -subscriptionTopicTrieSize: stats about client subscriptions count in trie data structure. -subscriptionTrieNodes: stats about client subscriptions nodes count in trie data structure. -retainMsgTrieSize: stats about retain message count in trie data structure. -retainMsgTrieNodes: stats about retain message nodes count in trie data structure. -lastWillClients: stats about last will clients count. -connectedSessions: stats about connected sessions count. -connectedSslSessions: stats about connected via TLS sessions count. -allClientSessions: stats about all client sessions count. -clientSubscriptions: stats about client subscriptions count in the in-memory map. -retainedMessages: stats about retain messages count in the in-memory map. -activeAppProcessors: stats about active APPLICATION processors count. -activeSharedAppProcessors: stats about active APPLICATION processors count for shared subscriptions. -runningActors: stats about running actors count.

PostgreSQL专用指标：

-sqlQueue_InsertUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about updating unauthorized clients to the database. -sqlQueue_DeleteUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about removing unauthorized clients to the database. -sqlQueue_LatestTimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about latest historical stats persistence to the database. -sqlQueue_TimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about historical stats persistence to the database.

注意，为达到最佳性能，TBMQ对上述每个队列使用多个队列（线程）。

获取帮助

Slack社区

向其他用户和贡献者提问，获取快速解答。

Github Issues

发现bug或有疑问？在我们的GitHub仓库中提交issue。

若上述指南均无法解决您的问题，请联系ThingsBoard团队获取进一步帮助。

联系我们