排查工具与技巧
Kafka队列:Consumer Group消息滞后
可使用下文所示日志识别消息处理或TBMQ基础设施其他部分的问题。
由于Kafka用于MQTT消息处理及系统其他主要部分(如 client sessions、client subscriptions、retained messages 等),
可据此分析broker的整体状态。
TBMQ支持监控向Kafka生产消息的速率是否快于消费与处理的速率。 若存在该情况,消息处理延迟将逐渐增大。 要启用此功能,请确保开启Kafka consumer-stats(参见配置属性中的 queue.kafka.consumer-stats 节)。
开启Kafka consumer-stats后,将生成consumer group的offset lag相关日志(参见故障排查)。
以下为日志示例:
1
2022-11-27 02:33:23,625 [kafka-consumer-stats-1-thread-1] INFO o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[msg-all-consumer-group] Topic partitions with lag: [[topic=[tbmq.msg.all], partition=[2], lag=[5]]].
由此可知有5条消息已推入 tbmq.msg.all 主题但尚未处理。
一般而言,日志结构如下:
1
TIME [STATS_PRINTING_THREAD_NAME] INFO o.t.m.b.q.k.s.TbKafkaConsumerStatsService-[CONSUMER_GROUP_NAME] Topic partitions with lag: [[topic=[KAFKA_TOPIC], partition=[KAFKA_TOPIC_PARTITION], lag=[LAG]],[topic=[ANOTHER_TOPIC], partition=[], lag=[]],...].
其中:
CONSUMER_GROUP_NAME- 正在处理消息的消费者组名称。KAFKA_TOPIC- Kafka主题名称。KAFKA_TOPIC_PARTITION- 主题分区号。LAG- 未处理消息数量。
注意:仅当该消费者组存在滞后时才打印consumer lag相关日志。
CPU/内存使用
有时问题源于某服务的资源不足。可登录 server/container/pod 并执行 top 命令查看CPU和内存使用情况。
为便于监控,建议配置Prometheus和Grafana。
若发现某服务CPU使用率时常达到100%,可通过在集群中创建新节点进行水平扩展,或通过增加CPU总量进行垂直扩展。
日志
查看日志
无论部署类型如何,TBMQ日志均存储在以下目录:
1
/var/log/thingsboard-mqtt-broker
不同部署工具提供不同的日志查看方式:
运行时查看最新日志: 可使用 grep 仅显示包含指定字符串的输出。 例如,检查 backend 侧是否有错误: 提示: 可将日志重定向到文件再用任意文本编辑器分析: 说明: 也可进入 TBMQ 容器内查看日志: |
查看集群中所有 pods: 查看指定 pod 的最新日志: 查看 TBMQ 日志: 可使用 grep 仅显示包含指定字符串的输出。 例如,检查 backend 侧是否有错误: 若有多节点,可将各节点日志重定向到本机文件再分析: 说明: 也可进入 TBMQ 容器内查看日志: |
启用特定日志
为便于故障排查,TBMQ允许用户为系统特定部分启用或禁用日志。可通过修改以下目录中的 logback.xml 文件实现:
1
/usr/share/thingsboard-mqtt-broker/conf
注意,k8s 和 Docker 部署使用不同的配置文件。
以下为 logback.xml 配置示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger{36}-%msg%n</pattern>
</encoder>
</appender>
<logger name="org.thingsboard.mqtt.broker.actors.client.service.connect" level="TRACE"/>
<logger name="org.thingsboard.mqtt.broker.actors.client.service.disconnect.DisconnectServiceImpl" level="INFO"/>
<logger name="org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem" level="OFF"/>
<root level="INFO">
<appender-ref ref="fileLogAppender"/>
</root>
</configuration>
配置文件中的 loggers 对故障排查最为有用,可为特定类或类组启用或禁用日志。上述示例中,默认日志级别为 INFO,即日志包含一般信息、警告和错误。对于 org.thingsboard.mqtt.broker.actors.client.service.connect 包则启用最详细级别。也可通过 OFF 级别完全禁用某部分的日志,如 org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem。
要为系统某部分启用或禁用日志,需添加相应的 </logger> 配置并等待最多10秒。
不同部署工具的日志更新方式不同:
指标
在TBMQ中启用Prometheus指标需:
- 将
STATS_ENABLED环境变量设为true。 - 在配置文件中将
METRICS_ENDPOINTS_EXPOSE环境变量设为prometheus。
此后可通过 https://<yourhostname>/actuator/prometheus 访问指标,并由Prometheus抓取(无需认证)。
Prometheus指标
TBMQ中的Spring Actuator可通过Prometheus暴露部分内部状态指标。
以下是TBMQ推送到Prometheus的指标列表:
TBMQ专用指标:
-incomingPublishMsg_published(统计名称-totalMsgs, successfulMsgs, failedMsgs):有关待在通用队列中持久化的传入发布消息的统计。 -incomingPublishMsg_consumed (statsNames-totalMsgs, successfulMsgs, timeoutMsgs, failedMsgs, tmpTimeout, tmpFailed, successfulIterations, failedIterations): stats about incoming Publish messages processing from general queue. -deviceProcessor (statsNames-successfulMsgs, failedMsgs, tmpFailed, successfulIterations, failedIterations): stats about DEVICE client messages processing. Some stats descriptions: -failedMsgs: number of failed messages to be persisted in database and were discarded afterwards-tmpFailed: number of failed messages to be persisted in database and got reprocessed later-appProcessor (statsNames-successfulPublishMsgs, successfulPubRelMsgs, tmpTimeoutPublish, tmpTimeoutPubRel, timeoutPublishMsgs, timeoutPubRelMsgs, successfulIterations, failedIterations): stats about APPLICATION client messages processing. Some stats descriptions: -tmpTimeoutPubRel: number of PubRel messages that timed out and got reprocessed later-tmpTimeoutPublish: number of Publish messages that timed out and got reprocessed later-timeoutPubRelMsgs: number of PubRel messages that timed out and were discarded afterwards-timeoutPublishMsgs: number of Publish messages that timed out and were discarded afterwards-failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully-appProcessor_latency (statsNames-puback, pubrec, pubcomp): stats about APPLICATION processor latency of different message types. -actors_processing (statsNames-MQTT_CONNECT_MSG, MQTT_PUBLISH_MSG, MQTT_PUBACK_MSG, etc.): stats about actors processing average time of different message types. -clientSubscriptionsConsumer (statsNames-totalSubscriptions, acceptedSubscriptions, ignoredSubscriptions): stats about the client subscriptions read from Kafka by the broker node. Some stats descriptions: -totalSubscriptions: total number of new subscriptions added to the broker cluster-acceptedSubscriptions: number of new subscriptions persisted by the broker node ignoredSubscriptions: number of ignored subscriptions since they were already initially processed by the broker node retainedMsgConsumer(统计名称-totalRetainedMsgs, newRetainedMsgs, clearedRetainedMsgs):有关保留消息处理的统计。 -subscriptionLookup:有关客户端订阅在trie数据结构中查询平均时间的统计。 -retainedMsgLookup: stats about average time of retain messages lookup in trie data structure. -clientSessionsLookup: stats about average time of client sessions lookup from found client subscriptions for publish message. -notPersistentMessagesProcessing: stats about average time for processing message delivery for not persistent clients. -persistentMessagesProcessing: stats about average time for processing message delivery for persistent clients. -delivery: stats about average time for message delivery to clients. -subscriptionTopicTrieSize: stats about client subscriptions count in trie data structure. -subscriptionTrieNodes: stats about client subscriptions nodes count in trie data structure. -retainMsgTrieSize: stats about retain message count in trie data structure. -retainMsgTrieNodes: stats about retain message nodes count in trie data structure. -lastWillClients: stats about last will clients count. -connectedSessions: stats about connected sessions count. -connectedSslSessions: stats about connected via TLS sessions count. -allClientSessions: stats about all client sessions count. -clientSubscriptions: stats about client subscriptions count in the in-memory map. -retainedMessages: stats about retain messages count in the in-memory map. -activeAppProcessors: stats about active APPLICATION processors count. -activeSharedAppProcessors: stats about active APPLICATION processors count for shared subscriptions. -runningActors: stats about running actors count.
PostgreSQL专用指标:
-sqlQueue_InsertUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about updating unauthorized clients to the database. -sqlQueue_DeleteUnauthorizedClientQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about removing unauthorized clients to the database. -sqlQueue_LatestTimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about latest historical stats persistence to the database. -sqlQueue_TimeseriesQueue_${index_of_queue} (statsNames-totalMsgs, failedMsgs, successfulMsgs): stats about historical stats persistence to the database.
注意,为达到最佳性能,TBMQ对上述每个队列使用多个队列(线程)。
获取帮助
若上述指南均无法解决您的问题,请联系ThingsBoard团队获取进一步帮助。