28Mar

Resolving config_database_cassandra Container Restart Loop

A not-so-fun error occurred earlier today when my standalone Contrail/Tungsten Fabric Controller host went down. After bringing it back up, Cassandra DB was reporting the following errors:

Standalone Controller is not a recommended design, due to the nature of components running in it, plus how vRouter connects with controllers. This post is discussing a PoC setup.

Using the command contrail-status to display all services on that node (output will be different on vRouter nodes):

== Contrail control ==
control: initializing (Database:Cassandra connection down)
nodemgr: initializing

== Contrail config-database ==
nodemgr: initializing (Cassandra state detected DOWN. )

== Contrail database ==
nodemgr: initializing

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: initializing
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: initializing
collector: initializing (Database:Cassandra, Database:contrail-01.ameen.lab:Global connection down)
topology: initializing (Database:Cassandra[] connection down)

== Contrail webui ==

== Contrail config ==
svc-monitor: initializing (Database:Cassandra[] connection down)
nodemgr: initializing
device-manager: initializing (ApiServer:ApiServer[] connection down)
api: initializing (Database:Cassandra[] connection down)
schema: initializing (ApiServer:ApiServer[] connection down)

Also, some services were reporting state UP for less than 2 minutes, while the controller node itself was up for almost an hour:

Pod              Service         Original Name                          State    Status      
config-database  cassandra       contrail-external-cassandra            running  Up 11 seconds  
database         cassandra       contrail-external-cassandra            running  Up About a minute  
control          nodemgr         contrail-nodemgr                       running  Up About a minute 
config-database  nodemgr         contrail-nodemgr                       running  Up 34 seconds 

Checking on the Cassandra container revealed the issue:

[root@contrail-01 ~]# docker ps -a | grep config_database_cassandra_1
CONTAINER ID        IMAGE                                                                               COMMAND                  CREATED             STATUS                   PORTS               NAMES
df9c3e2e21ea        hub.juniper.net/contrail/contrail-external-cassandra:5.0.2-0.360-queens             "/contrail-entrypoin…"   2 weeks ago         Up 11 seconds                                   config_database_cassandra_1

[root@contrail-01 ~]# docker logs -f config_database_cassandra_1 --tail 2

ERROR [main] 2019-03-24 09:18:02,078 JVMStabilityInspector.java:102 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 439651 of commit log /var/lib/cassandra/commitlog/CommitLog-6-1553080898520.log, with invalid CRC. The end of segment marker should be zero.

As mentioned in the error, file CommitLog-6-1553080898520.log had an invalid CRC which was preventing the services from running properly. Removing the corrupted file has fixed the issue:

[root@contrail-01 ~]# docker exec -it config_database_cassandra_1 /bin/bash -c 'rm /var/lib/cassandra/commitlog/CommitLog-6-1553080898520.log'

[root@contrail-01 ~]# docker logs config_database_cassandra_1 --tail 7
INFO [main] 2019-03-24 09:25:01,741 StorageService.java:622 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)
INFO [main] 2019-03-24 09:25:01,937 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 297 MB and a resize interval of 60 minutes
INFO [main] 2019-03-24 09:25:02,029 MessagingService.java:753 - Starting Messaging Service on /192.168.0.50:7012 (eth0)
INFO [main] 2019-03-24 09:25:02,270 StorageService.java:707 - Loading persisted ring state
INFO [main] 2019-03-24 09:25:02,273 StorageService.java:825 - Starting up server gossip
INFO [main] 2019-03-24 09:25:02,520 TokenMetadata.java:479 - Updating topology for /192.168.0.50
INFO [main] 2019-03-24 09:25:02,520 TokenMetadata.java:479 - Updating topology for /192.168.0.50

Running contrail-status one more time:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: active
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: active
nodemgr: active
device-manager: active
api: active
schema: active
Share this Story

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Written with love ♥