Friday, 08 January 2021 21:19

HANA – hdbcons first aid hacks

Written by Jens Gleichmann
Rate this item
(0 votes)
“© 2020. SAP SE or an SAP affiliate company. All rights reserved.” “Used with permission of SAP SE”

A tool which got more and more power since HANA 2.0 was released is hdbcons. It should only be used by experienced admins and should be tested in non-prod systems, because it can produce system crashes. There are several known issues documented in notes which describe scenarios you should avoid the usage of hdbcons in prod systems.
This blog won’t describe the full functional scope (see the central hdbcons note 2222218). I will show some general use cases which are not well known, but should be kept in mind for scenarios I call ‘HANA first aid’. At the end of the blog you will also get an example script for such a troubleshooting situation.

  1. General stuff about hdbcons
  2. HANA 2.0 SPS04 vs. SPS05 features
  3. Disk / Container / Log Management
  4. Memory / NUMA
  5. HSR
  6. Troubleshooting


General stuff about hdbcons

At first just use it if you are sure you don’t get this information by using the SQL way over e.g. HANA Cockpit. It remains as an expert tool. Also if you already used it 100 times. Every installation has its own topology and parameter set which can be different. Please try every command in the non-prod system before you use it in production. However, this just minimizes the risk. It doesn’t eliminate it!

When I should use hdbcons?

  • good if you don’t have SQL access or the needed roles
  • secondary site of a HSR
  • system is unresponsible

All 3 aspects have in common that there is no way to connect via SQL. In HSR scenario (besides active/active = read enabled) this is the case for the secondary site.

As already explained I won’t explain the full scope of the tool, but I will show some hacks how I use it:
On Linux level you can start hdbcons as sidadm:

#shows normal help hdbcons "help"

#shows all sub commands hdbcons "all help"

#is a quick search if, you do not know the exact command hdbcons "all help" | grep -i <search term> -B 10

hdbcons [-p <pid>] [-e <process_name>] Per default hdbcons will connect to the indexserver process, but you can specify the process ID of another SAP HANA process via "-p <pid>" or the name of another SAP HANA process via "-e <process_name>" (e.g. <process_name> = 'hdbnameserver').


HANA 2.0 SPS04 vs. SPS05 features

As there are no big new features in SPS05 the version are pretty identically. There is only one difference in context of CPU control regarding the LLC called cpuresctrl.

Here is the section overview of SPS05:
ae_tableload – Handle loading of column store tables and columns
all – Print help and other info for all hdbcons commands
authentication – Authentication management.
binarysemaphore – BinarySemaphore management
bye – Exit console client
cd – ContainerDirectory management
cfgreg – Basis Configurator
checktopic – CheckTopic management
cnd – ContainerNameDirectory management
conditionalvariable – ConditionalVariable management
connection – Connection management
context – Execution context management (i.e., threads)
cpuresctrl – Manage cpu resources such as last-level cache allocation
converter – Converter management
crash – Crash management
crypto – Cryptography management (SSL/SAML/X509/Encryption).
csaccessor – Display diagnostics related to the CSAccessor library
ddlcontextstore – Get DdlContextStore information
deadlockdetector – Deadlock detector.
debug – Debug management
distribute – Handling distributed systems
dvol – DataVolume management
ELF – ELF symbol resolution management
encryption – Persistence encryption management
eslog – Manipulate logger on extended storage
event – Event management
exit – Exit console client
flightrecorder – Flight Recorder
hananet – HANA-Net command interface
help – Display help for a command or command list
hkt – HANA Kernal Tracer (HKT) management
indexmanager – Get IndexManager information, especially for IndexHandles
itab – Internaltable diagnostics
jexec – Information and actions for Job Executor/Scheduler
licensing – Licensing management.
log – Show information about logger and manipulate logger
machine – Information about the machine topology
mm – Memory management
monitor – Monitor view command
mproxy – Malloc proxy management
msl – Mid size LOB management
mutex – Mutex management
numa – Provides NUMA statistics for all columns of a given table, broken down by column constituents like dictionary, data vector and index.
nvmprovider – NVM Provider
output – Command for managing output from the hdbcons
page – Page management
pageaccess – PageAccess management
profiler – Profiler
quit – Exit console client
readwritelock – ReadWriteLock management
replication – Monitor data and log replication
resman – ResourceManager management
rowstore – Row Store
runtimedump – Generate a runtime dump.
savepoint – Savepoint management
semaphore – Semaphore management
servicethreads – Thread information M_SERVICE_THREADS
snapshot – Snapshot management
stat – Statistics management
statisticsservercontroller – StatisticsServer internals
statreg – Statistics registry command
syncprimi – Syncprimitive management (Mutex, CondVariable, Semaphore, BinarySemaphore,
ReadWriteLock)
table – Table Management
tablepreload – Manage and monitor table preload
trace – Trace management
tracetopic – TraceTopic management
transaction – Transaction management
ut – UnifiedTable Management
version – Version management
vf – VirtualFile management
x2 – get X2 info


Disk / Container / Log Management

One of the most ask questions is about the reorg of the persistence layer. Here we have an easy command to get the information we need:

hdbcons "dvol info" ## Start command at: 2021-01-09 13:08:29.730 DataVolumePartition #0 (/hana/data/D02/mnt00001/hdb00004.00004/) size= 2593019396096 used= 2424372674560 NOTE: We DO NOT recommend reclaiming this DataVolume. Reclaiming may take considerable time and free up space that will have to be reallocated in subsequent savepoint cycles anyway. 

Here is also the recommendation if you should reorg/reclaim or not.

If we need more details how the data volume divided, we need information about the structure. The persistence is structured in container. We have several container types like “Unified Table”, “VirtualFileLOB” or “MidSizeContainerLOB”. Too many LOB files can be a performance overhead for the startup procedure, the size of the PageCache (Memory overhead) or issues regarding the garbage collection. This means you should always be aware of the size.

#global statistics to get an overview of non-temp and temp containers hdbcons "cd gstat" ## Start command at: 2021-01-08 14:54:52.570 Sum of non-temporary containers: FreeUndoFile : 7862 VirtualFileLOB : 49879170 VirtualFileCLOB : 25 VirtualFileNCLOB : 377153 VirtualFileMain : 287842 VirtualFileMetaData : 3 liveCache : 3 UnifiedTable : 89115 MidSizeLobContainerLOB : 22 MidSizeLobContainerCLOB : 1 MidSizeLobContainerNCLOB: 27 FixedSizeEntry : 1 Sum of temporary containers: raw size of ContainerDirectory : 92 MB, page count: 370 raw size of LOB ContainerDirectory : 19 GB, page count: 79951 raw size of MS LOBContainerDirectory: 256 kB, page count: 1 raw size of UNDO ContainerDirectory : 313 MB, page count: 1252 raw size of temp ContainerDirectory : 0 B, page count: 0 raw size of FileIDMap container : 4 GB, page count: 17885 raw size of LOB FileIDMap container : 96 MB, page count: 384

#query all containers hdbcons "cd stat -a" ## Start command at: 2021-01-09 14:02:10.059 UnifiedTable : count= 89115, sum page count= 1156561, sum raw size= 1533 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B VirtualFileLOB : count= 49879170, sum page count= 138155123, sum raw size= 593 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B MidSizeLobContainerLOB : count= 22, sum page count= 150284, sum raw size= 9 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B VirtualFileNCLOB : count= 378285, sum page count= 553743, sum raw size= 4 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B VirtualFileMain : count= 287842, sum page count= 309886, sum raw size= 3 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B MidSizeLobContainerNCLOB: count= 27, sum page count= 13859, sum raw size= 866 MB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B FreeUndoFile : count= 7954, sum page count= 7952, sum raw size= 108 MB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B VirtualFileMetaData : count= 3, sum page count= 17, sum raw size= 4 MB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B liveCache : count= 3, sum page count= 2, sum raw size= 512 kB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B VirtualFileCLOB : count= 26, sum page count= 52, sum raw size= 208 kB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B MidSizeLobContainerCLOB : count= 1, sum page count= 1, sum raw size= 64 kB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B FixedSizeEntry : count= 1, sum page count= 0, sum raw size= 0 B, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B time needed to collect data from container directories: 4.78 min [OK] ## Finish command at: 2021-01-09 14:06:57.154 command took: 4.785 min 

#query a specific type hdbcons "cd count -a -t <type>" hdbcons "cd count -a -t VirtualFileLOB​" ## Start command at: 2021-01-09 13:24:13.367 LOBContainerDirectory : count= 49879170, sum page count= 138155123, sum raw size= 593 GB, deleted count= 0, sum deleted page count= 0, sum deleted raw size= 0 B 

#find the top <n> objects hdbcons "cd top -l <n>" hdbcons "cd top -l 10" ## Start command at: 2021-01-09 13:45:55.874 Top Raw Size: 1: size: 47 GB, cID: 0xfe03235f24, I\1\SAPREQ\1\EDID4\1\\1\D\1\ || $container$ 2: size: 42 GB, cID: 0xfe03235f11, I\1\SAPREQ\1\CDPOS\1\\1\D\1\ || $container$ 3: size: 41 GB, cID: 0xfe03235f18, I\1\SAPREQ\1\VBFA\1\\1\D\1\ || $container$ 4: size: 39 GB, cID: 0xfe03235f13, I\1\SAPREQ\1\VRPMA\1\\1\D\1\ || $container$ 5: size: 37 GB, cID: 0xfe03235f15, I\1\SAPREQ\1\BALDAT\1\\1\D\1\ || $container$ 6: size: 36 GB, cID: 0xfe03235f14, I\1\SAPREQ\1\VBRP\1\\1\D\1\ || $container$ 7: size: 32 GB, cID: 0xfe03235f0d, I\1\SAPREQ\1\CE40010_ACCT\1\\1\D\1\ || $container$ 8: size: 30 GB, cID: 0xfe03235f0e, I\1\SAPREQ\1\BSIM\1\\1\D\1\ || $container$ 9: size: 26 GB, cID: 0xfe7cd37a90, I\1\SAPREQ\1\_SYS_SPLIT_BSEG~7\1\\1\D\1\ || $container$ 10: size: 26 GB, cID: 0xfe7cd37a91, I\1\SAPREQ\1\_SYS_SPLIT_BSEG~8\1\\1\D\1\ || $container$ 

With this commands you will get all details you need for a detailed analysis

From data volume you will go further to the log volume.
There are min. 2 log segments. One in state free and one in state writing. There can be more till the log volume is full.

#show state of each log segments hdbcons "log info" | grep STATE

#to show / count all log segments with state free hdbcons "log info" | grep STATE | grep Free | wc -l

#to show / count all log segments NOT in state free hdbcons "log info" | grep STATE | grep -v Free | wc -l

More details about the log segments can be found in other blogs/notes like:
Details about HANA overwrite log mode
2083715 – Analyzing log volume full situations
1679938 – Disk Full Event on Log Volume


Memory / NUMA

One of the important function is to get an overview over all tenants. The easiest way – in my opinion – is the inter process memory management (IPMM). This is also an important part of every RTE dump. IPMM includes PAL, TPAL, AB and U. What do these acronyms mean?

#create IPMM report hdbcons "mm ipmm" Inter Process Memory Management IPMM shared memory segment: SAP_IPMM_O048_D02 SId= Process secondary identifier AB= Allocated bytes by allocators for heap and stacks (inclusive allocator cache) UA= Unaccounted memory (by unlimited allocators for exceptions etc.), maybe overcommit U= Used bytes for heap and stacks (without allocator cache) FSL= Free segment list size FMC= Free precharge memory cache size SMAS/SMDS= System memory traffic information (accumulation of allocations/deallocations) PAL= Process allocation limit TPAL= Temp process allocation limit Blocked= Memory (TPAL!) blocked/reserved for this process W= Weight of process [0] PID=250869 (state=S), SId=640483232, compactors active, alive, process name: hdbnameserver AB=166.63gb (178927042560b), UA=0b, U=7.08gb (7612028948b), FSL=6gb (6442450944b), FMC=0b, SMAS=1.52tb (1678410842112b), SMDS=1.36tb (1499367014400b), PAL=7.74tb (8516251680768b), TPAL=168.45gb (180875427840b), W=100 [1] PID=251734 (state=S), SId=640483233, compactors active, alive, process name: hdbcompileserver AB=4.57gb (4913692672b), UA=0b, U=677.12mb (710020245b), FSL=0b, FMC=0b, SMAS=214.09gb (229882134528b), SMDS=209.38gb (224827015168b), PAL=7.74tb (8516251680768b), TPAL=4.57gb (4913692672b), Blocked=1.02gb (1102446592b) (perm=1102446592B, temp=0), W=100 [2] PID=251737 (state=S), SId=640483234, compactors active, alive, process name: hdbpreprocessor AB=3.45gb (3715432448b), UA=0b, U=670.25mb (702814346b), FSL=0b, FMC=0b, SMAS=185.04gb (198688833536b), SMDS=181.46gb (194850193408b), PAL=7.74tb (8516251680768b), TPAL=3.45gb (3715432448b), W=100 [3] PID=280438 (state=S), SId=640483247, compactors active, alive, process name: hdbxsengine AB=25.36gb (27232108544b), UA=0b, U=1.96gb (2105583430b), FSL=0b, FMC=0b, SMAS=304.03gb (326460309504b), SMDS=278.56gb (299110891520b), PAL=7.74tb (8516251680768b), TPAL=25.60gb (27491827712b), W=100 [4] PID=279916 (state=R), SId=640483246, compactors active, alive, process name: hdbindexserver AB=2.14tb (2361270599680b), UA=0b, U=1005.04gb (1079160726009b), FSL=198.25gb (212869316608b), FMC=0b, SMAS=9.21tb (10130337824768b), SMDS=7.06tb (7768950571008b), PAL=7.74tb (8516251680768b), TPAL=2.17tb (2394008911872b), W=100 [5] PID=266322 (state=S), SId=640483237, compactors active, alive, process name: hdbwebdispatcher AB=4.48gb (4814077952b), UA=0b, U=934.67mb (980079007b), FSL=0b, FMC=0b, SMAS=232.21gb (249344753664b), SMDS=227.61gb (244402880512b), PAL=7.74tb (8516251680768b), TPAL=4.48gb (4814077952b), W=100 [6] PID=266887 (state=S), SId=640483238, compactors active, alive, process name: hdbindexserver AB=4.52tb (4977854775296b), UA=0b, U=3.24tb (3569697000331b), FSL=167.25gb (179583320064b), FMC=0b, SMAS=119.61tb (131515736981504b), SMDS=115.08tb (126537765879808b), PAL=7.74tb (8516251680768b), TPAL=4.63tb (5092921180160b), W=100 [7] PID=266890 (state=S), SId=640483239, compactors active, alive, process name: hdbxsengine AB=29.97gb (32184336384b), UA=0b, U=2.08gb (2239677798b), FSL=0b, FMC=0b, SMAS=355.90gb (382152015872b), SMDS=325.82gb (349850238976b), PAL=7.74tb (8516251680768b), TPAL=30.01gb (32224116736b), W=100 AB=64.69gb (69462654976b), UA=0b, U=800.17mb (839044495b), FSL=5.75gb (6174015488b), FMC=2.50gb (2684420096b), SMAS=211.13gb (226702262272b), SMDS=146.30gb (157099687936b), PAL=7.74tb (8516251680768b), TPAL=64.74gb (69514493952b), W=100 GLOBAL_MAX_ALLOCATION_LIMIT=7.74tb (8516251680768b) SHARED_MEMORY=139.59gb (149889613824b) OOM_RESERVATION_ALLOCATORS=96.18mb (100859904b) CODE_SIZE=2.44gb (2623602688b) IPMM_RESERVATION=0b Sum TPAL (including shm, code size and IPMM reservation but not IPMM OOM reservation)=7.74tb (8516117463040b) (cached sum=7.74tb (8516117463040b)), emergency reservation=128mb (134217728b) #processes=18, weight sum=1800, cached weight sum=1800, #checks for cached allocation limit sum=1457242, #terminated processes since last reset=9 Early exits are currently allowed. Last reclaim memory started to wait for IPMM lock since 2021-01-09 15:49:28.852. Last reclaim memory got IPMM lock at 2021-01-09 15:49:28.852. User ids respected in shm calculation: [17286] Parallel memory reclaim tasks are activated. This is an opt build. Peak used memory: 5.50tb (6057423072499b) with corresponding values (Heap=5.37tb (5906923121907b), Shared=137.71gb (147876347904b), Code=2.44gb (2623602688b)). Peak heap memory: 5.37tb (5906923121907b). Peak shared memory: 221.46gb (237802225664b). Peak code size: 2.44gb (2623602688b)

Important is that the sum of all PAL are lower than the GAL. With SAP HANA 2 SPS03, the free memory cache (FMC) was introduced in SAP HANA. This feature enables HANA to efficiently reuse the already allocated memory.

Some rules:
AB > U
AB >= TPAL
PAL > AB

PAL = <service>.ini -> [memorymanager] -> allocationlimit
GAL = global.ini -> [memorymanager] -> global_allocation_limit

You want some details about the NUMA topology? No problem:

#Virtual Memory allocations by HANA Memory Manager hdbcons "mm numa -v" ## Start command at: 2021-01-09 15:46:34.490 ************************************************************************************ Virtual Memory allocations by HANA Memory Manager ************************************************************************************ Total number of NUMA Nodes = 8 Target NODE #Allocs #Deallocs Alloc size (MB) Fallback size (MB) Dealloc size (MB) ------------ ------- --------- --------------- ------------------ ----------------- 0 1243 579 929401.88 0.00 696546.06 1 1569 723 878001.88 0.00 611223.69 2 1804 671 981302.19 0.00 604844.62 3 1980 827 1152106.75 0.00 766656.25 4 1362 881 894711.38 0.00 715093.94 5 2645 1541 2286683.88 0.00 1855888.88 6 1814 1268 1080738.38 0.00 889302.94 7 1716 920 1344334.12 0.00 1008572.25 INTERLEAVED 0 0 0.00 -- 0.00 NOTE: Fallback schemes are used if 'mbind' to target node fails! ​

#Physical Memory allocations by OS (resident pages) hdbcons "mm numa -p" ## Start command at: 2021-01-09 15:47:26.270 ******************************************************************************** Physical Memory allocations by OS (resident pages) ******************************************************************************** Total number of NUMA Nodes = 8 Target NODE Local allocation (MB) Remote allocation (MB) NUMA HIT Rate(%) ----------- --------------------- ---------------------- ---------------- 0 131055.12 75951.12 63.31 1 138521.44 100013.12 58.07 2 193621.62 149436.81 56.44 3 268244.00 77401.75 77.61 4 89977.88 69812.50 56.31 5 320315.00 80008.69 80.01 6 121304.75 69729.06 63.50 7 199923.38 134092.31 59.85 INTERLEAVED 0.00 -- -- ********************************************************************************

My system is a test system with remote allocation. In a production environment this musn’t be the case!


HSR

On the secondary site are some details which you can only see there. If you monitor the HSR on the primary site with the python script systemReplicationStatus.py or via HANA Cockpit you only see the data shipping. If this is complete the HSR is running with log shipping. This is correct, but are the systems at this point in time in sync? No, there is a backlog and some steps missing which can be monitored via hdbcons.

#show replay backlog section hdbcons "replication info" | grep replayBacklog​ - replayBacklog : 0 microseconds - replayBacklogSize : 0 bytes - replayBacklogMax : 13065250401 microseconds - replayBacklogSizeMax : 46759845888 bytes 

The backlog should be close to 0 (depending on the sync mode). Only then is it guaranteed that the systems are in the sync!

#Details about the replay steps hdbcons "replication info" | grep replayStep - replayStepStatus : ReplayStep_None - replayStepFinishLogPos : 0x5254392ff80 - replayStepFinishLogPosTimestamp : 09.01.2021-16.59.59 (1610211599646137) - replayStepStartLogPos : 0x5254392ff80 - replayStepStartLogPosTimestamp : 09.01.2021-16.59.59 (1610211599646137) - replayStepCount : 338611163 

ReplayStep_None means currently the systems are in sync and no step is open

#Full backup size hdbcons "replication info" | grep FullBackupSize - shippedFullBackupSize : 5829827788800 bytes - shippedFullBackupSizeNet : 5829827788800 bytes (100.00%) - shippedLastFullBackupSize : 5829827788800 bytes - shippedLastFullBackupSizeNet : 5829827788800 bytes (100.00%) 


Troubleshooting

The first step when you forcing an issue, create a runtime dump. Here you can use the HANA Dump Analyzer

#Generate call stacks for active threads with sql hdbcons "context l -s -q" | grep -v suppressed

#wait analysis to indentify the activity responsible for the locking indexmanager waitgraph -w

#Job execution log jexec logstart -c <events_in_million> jexec logstop jexec logprint <file_name>.jexlog

If you want create your own HANA first aid script, just use the hdbcons commands. Here an example (which is not adapted for a multi tenant system):

 #!/bin/sh c_date=$(date +%F-%H%M%S) function runtimedump { echo "Create RuntimeDump" hdbcons "runtimedump dump" } function job_execs { echo "Create Job Execution Log" hdbcons "jexec logstart -c 30" sleep 10 hdbcons "jexec logstop" hdbcons "jexec logprint "/tmp/${c_date}.jexlog"" } function waitgraph { echo "Create Waitgraph" hdbcons "indexmanager waitgraph -w -o /tmp/waitgraph_${c_date}.dot" dot -Tpdf -o /tmp/waitgraph_${c_date}.pdf /tmp/waitgraph_${c_date}.dot } function active_threads { echo "Create Active Threads Log" hdbcons "context l -s -q" | grep -v suppressed > /tmp/active_threads_${c_date}.log } runtimedump job_execs waitgraph active_threads runtimedump echo "####################################" echo "created files:" echo "/tmp/${c_date}.jexlog" echo "/tmp/waitgraph_${c_date}.pdf" echo "/tmp/active_threads_${c_date}.log" echo "####################################"

With this details the SAP support or every other HANA expert can create a first analysis for the situation. The experts normally need the callstacks, active threads, memory details and more of the point in time situation of the issue. With such an first aid script you can create this information by executing one script. This saves a lot of time and is uniform. Every admin knows it and unexpected system behaviour and you do not know where to start => easy, execute the first aid script. At XLC we have developed a comprehensive HDB first aid tool that saves our customers a lot of time and trouble.

You can read out some views of the passive HSR site via SQL over the primary. If you don’t have SQL access you can read out some (not all!)

#list all available monitor views hdbcons "monitor list"

#search for a specific term hdbcons "monitor list" | grep <search_term>

#print content of monitoring views monitor print -h -n <monitor_view_name>


At the end you should know some hacks which allow you to troubleshoot your HANA system via hdbcons without any SQL access. With this knowledge you can create your own HANA first aid script to collect important information during time of the issue.

Read 58 times

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.