Apache Cassandra is a distributed system, so it can limit the number of requests per node and can balance load better. But in any distributed system when you are trying to search the root cause of any problem, you need to narrow down where the problem is occurring and affecting the entire cluster. In Apache Cassandra, it denotes identifying the nodes or instances that are responsible for the problem in the entire structure. Only after that you can fix the problem and have a resolution.
An effective strategy for identifying where the problem is and what is the scope of the problem, you need to leverage metrics data for gaining much focussed insights and log analysis. This helps in identifying the root cause easily. Cassandra provides users with different metrics that enables incident response. By analysing these metrics, the existence of the problem and their accurate location (either a node or data centre) can be identified.
Let’s discuss the feasible metrics that can be used for root cause analysis
- Metrics related to client requests- The metrics related to client requests provide insights into failures, timeouts, request type, latency and thoroughput etc
- Metrics related to table- The metrics which allow for accessing the performance of the tables include
- MemtableOnHeapSize- Amount of data residing on heap in the memtable
- MemtableOffHeapSize-Amount of data residing off heap in the memtable
- MemtableLiveDataSize- mount of stored live data in the memtable
- AllMemtablesOnHeapSize- Amount of data stored in memtables
- AllMemtablesLiveDataSize- Amount of live data in memtables including pending flush and 2i memtables
- MemtableColumnsCount- The number of columns in memtable
- MemtableSwitchCount- Number of times memtable has been switched out because of flush
- CompressionRatio- The compression ratio for all SSTables
- ReadLatency- It denotes the local read latency for the table
- RangeLatency- It denotes the local range latency for the table
- PendingFlushes-It denotes the average number of flush tasks pending for the table
- BytesFlushed- It denotes the number of flushed bytes since server restart
There are many other metrics which are taken into consideration for knowing the performance of the tables in the database
Similarly, there are many other keyspace metrics to access the performance of the keyspaces
Collecting Cassandra metrics is generally done with
- Node tool
- Integrations of JMX/Metrics
Node tool– This is a command line tool. The node tool runs straight from an operational node. It helps viewing detailed metrics for tables alongwith server metrics and compaction statistics.
There are many steps for troubleshooting in Cassandra
- Identify the faulty nodes
- The nodes consist of multiple key spaces. One needs to find the faulty key space within the node
- The next step is finding out errors or defects in the source code. One can dry run the whole code or find out which part is faulty.
- There may be changes in properties or versions leading to error. For mitigating this the user needs to script the source code according to the current properties.
- The execution files of Cassandra, CQL and the APIs may also malfunction.
- After having found the errors and the sources of errors, the next step is to fix them by debugging or fixing the error parts or writing the entire codes.