Apache Cassandra is a distributed system, limiting the number of requests per node and balancing load better. But in any distributed system, when you are trying to search the root cause of any problem, you need to narrow down where the problem is occurring and affecting the entire cluster. In Apache Cassandra, it denotes identifying the nodes or instances that are responsible for the problem in the entire structure. Only after that can you fix the problem and have a resolution.
An effective strategy for identifying where the problem is and for identifying the scope of the problem, you need to leverage metrics data to gain focused insights and log analysis. This helps in identifying the root cause quickly. Cassandra provides users with different metrics that enable incident response. By analyzing these metrics, the existence of the problem and their accurate location (either a node or data center) can be identified.
Let’s discuss the feasible metrics that can be used for root cause analysis.
- Metrics related to client requests- The metrics related to client requests provide insights into failures, timeouts, request type, latency and throughput, etc.
- Metrics related to the table- The metrics which allow for accessing the performance of the tables include
- MemtableOnHeapSize- Amount of data residing on the heap in the memtable
- MemtableOffHeapSize-Amount of data residing off-heap in the memtable
- MemtableLiveDataSize- mount of stored live data in the memtable
- AllMemtablesOnHeapSize- Amount of data stored in memtables
- AllMemtablesLiveDataSize- Amount of live data in memtables including pending flush and 2i memtables
- MemtableColumnsCount- The number of columns in memtable
- MemtableSwitchCount- Number of times memtable has been switched out because of flush
- CompressionRatio- The compression ratio for all SSTables
- ReadLatency- It denotes the local read latency for the table.
- RangeLatency- It denotes the local range latency for the table.
- PendingFlushes-It denotes the average number of flush tasks pending for the table.
- BytesFlushed- It denotes the number of flushed bytes since server restart
Many other metrics are considered for knowing the performance of the tables in the database.
Similarly, there are many other keyspace metrics to access the performance of the keyspaces.
Collecting Cassandra metrics is generally done with
- Node tool
- Integrations of JMX/Metrics
Node tool– This is a command-line tool. The node tool runs straight from an operational node. It helps to view detailed metrics for tables along with server metrics and compaction statistics.
There are many steps for troubleshooting in Cassandra.
- Identify the faulty nodes
- The nodes consist of multiple keyspaces. One needs to find the faulty keyspace within the node.
- The next step is finding out errors or defects in the source code. One can dry run the whole code or find out which part is faulty.
- There may be changes in properties or versions leading to errors. For mitigating this, the user needs to script the source code according to the current properties.
- The execution files of Cassandra, CQL, and the APIs may also malfunction.
- After finding the errors and the sources of errors, the next step is to fix them by debugging or fixing the errors or writing the entire code.