A not-so-fun error occurred earlier today when my standalone Contrail/Tungsten Fabric Controller host went down. After bringing it back up, Cassandra DB was reporting the following errors:
Using the command contrail-status to display all services on that node (output will be different on vRouter nodes):
== Contrail control == control: initializing (Database:Cassandra connection down) nodemgr: initializing == Contrail config-database == nodemgr: initializing (Cassandra state detected DOWN. ) == Contrail database == nodemgr: initializing == Contrail analytics == snmp-collector: initializing (Database:Cassandra[] connection down) query-engine: initializing alarm-gen: initializing (Database:Cassandra[] connection down) nodemgr: initializing collector: initializing (Database:Cassandra, Database:contrail-01.ameen.lab:Global connection down) topology: initializing (Database:Cassandra[] connection down) == Contrail webui == == Contrail config == svc-monitor: initializing (Database:Cassandra[] connection down) nodemgr: initializing device-manager: initializing (ApiServer:ApiServer[] connection down) api: initializing (Database:Cassandra[] connection down) schema: initializing (ApiServer:ApiServer[] connection down)
Also, some services were reporting state UP for less than 2 minutes, while the controller node itself was up for almost an hour:
Pod Service Original Name State Status config-database cassandra contrail-external-cassandra running Up 11 seconds database cassandra contrail-external-cassandra running Up About a minute control nodemgr contrail-nodemgr running Up About a minute config-database nodemgr contrail-nodemgr running Up 34 seconds
Checking on the Cassandra container revealed the issue:
[root@contrail-01 ~]# docker ps -a | grep config_database_cassandra_1 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES df9c3e2e21ea hub.juniper.net/contrail/contrail-external-cassandra:5.0.2-0.360-queens "/contrail-entrypoin…" 2 weeks ago Up 11 seconds config_database_cassandra_1 [root@contrail-01 ~]# docker logs -f config_database_cassandra_1 --tail 2 ERROR [main] 2019-03-24 09:18:02,078 JVMStabilityInspector.java:102 - Exiting due to error while processing commit log during initialization. org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 439651 of commit log /var/lib/cassandra/commitlog/CommitLog-6-1553080898520.log, with invalid CRC. The end of segment marker should be zero.
As mentioned in the error, file CommitLog-6-1553080898520.log had an invalid CRC which was preventing the services from running properly. Removing the corrupted file has fixed the issue:
[root@contrail-01 ~]# docker exec -it config_database_cassandra_1 /bin/bash -c 'rm /var/lib/cassandra/commitlog/CommitLog-6-1553080898520.log' [root@contrail-01 ~]# docker logs config_database_cassandra_1 --tail 7 INFO [main] 2019-03-24 09:25:01,741 StorageService.java:622 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4) INFO [main] 2019-03-24 09:25:01,937 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 297 MB and a resize interval of 60 minutes INFO [main] 2019-03-24 09:25:02,029 MessagingService.java:753 - Starting Messaging Service on /192.168.0.50:7012 (eth0) INFO [main] 2019-03-24 09:25:02,270 StorageService.java:707 - Loading persisted ring state INFO [main] 2019-03-24 09:25:02,273 StorageService.java:825 - Starting up server gossip INFO [main] 2019-03-24 09:25:02,520 TokenMetadata.java:479 - Updating topology for /192.168.0.50 INFO [main] 2019-03-24 09:25:02,520 TokenMetadata.java:479 - Updating topology for /192.168.0.50
Running contrail-status one more time:
== Contrail control == control: active nodemgr: active named: active dns: active == Contrail config-database == nodemgr: active zookeeper: active rabbitmq: active cassandra: active == Contrail database == kafka: active nodemgr: active zookeeper: active cassandra: active == Contrail analytics == snmp-collector: active query-engine: active api: active alarm-gen: active nodemgr: active collector: active topology: active == Contrail webui == web: active job: active == Contrail config == svc-monitor: active nodemgr: active device-manager: active api: active schema: active