When will my rebalance complete

July 22, 2012, 2:00 am

≪ Previous: Exadata motherboard replacement

This has to be one of the top ASM questions people ask me. But if you expect me to respond with a number of minutes, you will be disappointed. After all, ASM has given you an estimate, and you still want to know when exactly is that rebalance going to finish. Instead, I will show you how to check if the rebalance is actually progressing, what phase it is in, and if there is a reason for concern.

Understanding the rebalance

As explained in the rebalancing act, the rebalance operation has three phases - planning, extents relocation and compacting. As far as the overall time to complete is concerned, the planing phase time is insignificant so there is no need to worry about it. The extent relocation phase will take most of the time, so the main focus will be on that.I will also show what is going on during the compacting phase.

It is important to know why the rebalance is running. If you are adding a new disk, say to increase the available disk group space, it doesn't really matter how long it will take for the rebalance to complete. OK maybe it does, if your database is hung because you ran out of space in your archive log destination. Similarly if you are resizing or dropping disk(s), to adjust the disk group space, you are generally not concerned with the time it takes for the rebalance to complete.

But if a disk has failed and ASM has initiated rebalance, there may be legitimate reason for concern. If your disk group is normal redundancy AND if another disk fails AND it's the partner of that disk that has already failed, your disk group will be dismounted, all your databases that use that disk group will crash and you may lose data. In such cases I understand that you want to know when that rebalance will complete. Actually, you want to see the relocation phase completed, as once it does, all your data is fully redundant again.

Extents relocation

To have a closer look at the extents relocation phase, I drop one of the disks with the default rebalance power:

SQL> show parameter power

NAME TYPE VALUE
------------------------------------ ----------- ----------------
asm_power_limit integer 1

SQL> set time on
16:40:57 SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06;

Diskgroup altered.

Initial estimated time to complete is 26 minutes:

16:41:21 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1        516      53736       2012          26
         4 REBAL WAIT          1

About 10 minutes into the rebalance, the estimate is 24 minutes:

16:50:25 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1      19235      72210       2124          24
         4 REBAL WAIT          1

While that EST_MINUTES doesn't give me much confidence, I see that the SOFAR (number of allocation units moved so far) is going up, which is a good sign.

ASM alert log shows the time of the drop disk, the OS process ID of the ARB0 doing all the work, and most importantly - that there are no errors:

Wed Jul 11 16:41:15 2012
SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
...
NOTE: starting rebalance of group 1/0x6ecaf3e6 (DATA1) at power 1
Starting background process ARB0
Wed Jul 11 16:41:24 2012
ARB0 started with pid=41, OS id=58591
NOTE: assigning ARB0 to group 1/0x6ecaf3e6 (DATA1) with 1 parallel I/O
NOTE: F1X0 copy 3 relocating from 0:2 to 55:35379 for diskgroup 1 (DATA1)
...

ARB0 trace file should show which file extents are being relocated. It does, and that is how I know that ARB0 is doing what it is supposed to do:

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_58591.trc
...
ARB0 relocating file +DATA1.282.788356359 (120 entries)
*** 2012-07-11 16:48:44.808
ARB0 relocating file +DATA1.283.788356383 (120 entries)
...
*** 2012-07-11 17:13:11.761
ARB0 relocating file +DATA1.316.788357201 (120 entries)
*** 2012-07-11 17:13:16.326
ARB0 relocating file +DATA1.316.788357201 (120 entries)
...

Note that there may be lot of arb0 trace files in the trace directory, so that's why we need to know the OS process ID of the ARB0 actually doing the rebalance. That information is in the alert log of the ASM instance performing the rebalance.

I can also look at the pstack of the ARB0 process to see what is going on. It does show me that ASM is relocating extents (key functions on the stack being kfgbRebalExecute - kfdaExecute - kffRelocate):

# pstack 58591
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003d4beaa in kfgbRebalExecute ()
#19 0x0000000003d39627 in kfgbDriver ()
#20 0x00000000020e8d23 in ksbabs ()
#21 0x0000000003d4faae in kfgbRun ()
#22 0x00000000020ed95d in ksbrdp ()
#23 0x0000000002322343 in opirip ()
#24 0x0000000001618571 in opidrv ()
#25 0x0000000001c13be7 in sou2o ()
#26 0x000000000083ceba in opimai_real ()
#27 0x0000000001c19b58 in ssthrdmain ()
#28 0x000000000083cda1 in main ()

After about 35 minutes the EST_MINUTES dropps to 0:

17:16:54 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN           1      74581      75825       2129           0
         3 REBAL WAIT          1
         4 REBAL WAIT          1

And soon after that, the ASM alert log shows:

Disk emptied
Disk header erased
PST update completed successfully
Disk closed
Rebalance completed

Wed Jul 11 17:17:32 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:17:41 2012
GMON updating for reconfiguration, group 1 at 20 for pid 38, osid 93832
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA1_CD_06_CELL06 emptied
NOTE: erasing header on grp 1 disk DATA1_CD_06_CELL06
NOTE: process _x000_+asm2 (93832) initiating offline of disk 0.3916039210 (DATA1_CD_06_CELL06) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 21 for pid 38, osid 93832
NOTE: PST update grp = 1 completed successfully
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x7e, op = clear
GMON updating disk modes for group 1 at 22 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: DATA1_CD_06_CELL06
NOTE: PST update grp = 1 completed successfully
GMON updating for reconfiguration, group 1 at 23 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: (not open) DATA1_CD_06_CELL06
NOTE: group 1 PST updated.
Wed Jul 11 17:17:41 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 24 for pid 19, osid 38421
GMON querying group 1 at 25 for pid 19, osid 38421
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

So the estimated time was 26 minutes and the rebalance actually took about 36 minutes (in this particular case the compacting took less than a minute so I have ignored it). That is why it is more important to understand what is going on, then to know when will the rebalance complete.

Note that the estimated time may also be increasing. If the system is under heavy load, the rebalance will take more time - especially with the rebalance power 1. For a large disk group (many TB) and large number of files, the rebalance can take hours and possibly days.

If you want to get an idea how long will a drop disk take in your environment, you need to test it. Just drop one of the disks, while your system is under normal/typical load. Your data is fully redundant during such disk drop, so you are not exposed to a disk group dismount in case its partner disk fails during the rebalance.

Compacting

In another example, to look at the compacting phase, I add the same disk back, with rebalance power 10:

17:26:48 SQL> alter diskgroup DATA1 add disk '/o/*/DATA1_CD_06_celll06' rebalance power 10;

Diskgroup altered.

Initial estimated time to complete is 6 minutes:

17:27:22 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN          10        489      53851       7920           6
         3 REBAL WAIT         10
         4 REBAL WAIT         10

After about 10 minutes, the EST_MINUTES drops to 0:

17:39:05 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         2 REBAL RUN          10      92407      97874       8716           0
         4 REBAL WAIT         10

And I see the following in the ASM alert log

Wed Jul 11 17:39:49 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:39:58 2012
GMON updating for reconfiguration, group 1 at 31 for pid 43, osid 115117
NOTE: group 1 PST updated.
Wed Jul 11 17:39:58 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 32 for pid 19, osid 38421
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

That means ASM has completed the second phase of the rebalance and is compacting now. If that is true, the pstack should show kfdCompact() function. Indeed it does:

# pstack 103326
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003c4b737 in kfdCompact ()
#19 0x0000000003c4c6d0 in kfdExecute ()
#20 0x0000000003d4bf0e in kfgbRebalExecute ()
#21 0x0000000003d39627 in kfgbDriver ()
#22 0x00000000020e8d23 in ksbabs ()
#23 0x0000000003d4faae in kfgbRun ()
#24 0x00000000020ed95d in ksbrdp ()
#25 0x0000000002322343 in opirip ()
#26 0x0000000001618571 in opidrv ()
#27 0x0000000001c13be7 in sou2o ()
#28 0x000000000083ceba in opimai_real ()
#29 0x0000000001c19b58 in ssthrdmain ()
#30 0x000000000083cda1 in main ()

The tail on ARB0 trace file now shows relocating just 1 entry at the time (another sign of compacting):

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_103326.trc
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
...

The V$ASM_OPERATION keeps showing EST_MINUTES=0 (compacting):

17:42:39 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         4 REBAL WAIT         10
         2 REBAL RUN          10      98271      98305       7919           0

The X$KFGMG shows REBALST_KFGMG=2 (compacting):

17:42:50 SQL> select NUMBER_KFGMG, OP_KFGMG, ACTUAL_KFGMG, REBALST_KFGMG from X$KFGMG;

NUMBER_KFGMG OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
1 1 10 2

Once the compacting phase completes, the alert log shows "stopping process ARB0" and "rebalance completed":

Wed Jul 11 17:43:48 2012
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)

In this case, the extents relocation took about 12 minutes and the compacting took about 4 minutes.

The compacting phase can actually take significant amount of time. In one case I have seen the extents relocation run for 60 minutes and the compacting after that took another 30 minutes. But it doesn't really matter how long it takes for the compacting to complete, because as soon as the second phase of the rebalance (extent relocation) completes, all data is fully redundant and we are not exposed to disk group dismount due to partner disk failure.

Changing the power

Rebalance power can be changed dynamically, i.e. during the rebalance, so if your rebalance with the default power is 'too slow', you can increase it. How much? Well, do you understand your I/O load, your I/O throughput and most importantly your limits? If not, increase the power to 5 (just run 'ALTER DISKGROUP ... REBALANCE POWER 5;') and see if it makes a difference. Give it 10-15 minutes, before you jump to conclusions. Should you go higher? Again, as long as you are not adversely impacting your database I/O performance, you can keep increasing the power. But I haven't seen much improvement beyond power 30.

The testing is the key here. You really need to test this under your regular load and in your production environment. There is no point testing with no databases running and on a system that runs off different storage system.

↧

Where is my data

October 27, 2012, 5:38 am

≫ Next: Identification of under-performing disks in Exadata

≪ Previous: When will my rebalance complete

Sometimes we want to know where exactly is a particular database block - on which ASM disk, in which allocation unit on that disk and in which block of that allocation unit. In this post I will show how to work that out.

Database instance

In the first part of this exercise I am logged into database instance. Let's create a tablespace first.

SQL> create tablespace T1 datafile '+DATA';
Tablespace created.

SQL> select f.FILE#, f.NAME "File name", t.NAME "Tablespace name"
from V$DATAFILE f, V$TABLESPACE t
where t.NAME='T1' and f.TS# = t.TS#;

FILE# File name Tablespace name
---------- ---------------------------------------- ----------------
6 +DATA/br/datafile/t1.272.797809075 T1

SQL>

Note the ASM file number is 272. Let's now create a table and insert some data into it

SQL> create table TAB1 (n number, name varchar2(16)) tablespace T1;
Table created.

SQL> insert into TAB1 values (1, 'CAT');
1 row created.

SQL> commit;
Commit complete.

Now find out the block number where that data is and check the block size.

SQL> select ROWID, NAME from TAB1;

ROWID              NAME
------------------ ----------------
AAASxxAAGAAAACHAAA CAT

SQL> select DBMS_ROWID.ROWID_BLOCK_NUMBER('AAASxxAAGAAAACHAAA') "Block number" from DUAL;

Block number
------------
         135

SQL> show parameter db_block_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_block_size                        integer     8192

From the above I see that Oracle block size is 8KB and that my data is in block 135 in ASM datafile 272.

ASM instance

I now connect to ASM instance and first check the extent distributions for ASM datafile 272.

SQL> select GROUP_NUMBER from V$ASM_DISKGROUP where NAME='DATA';

GROUP_NUMBER
------------
           1

SQL> select PXN_KFFXP, -- physical extent number
XNUM_KFFXP, -- virtual extent number
DISK_KFFXP, -- disk number
AU_KFFXP    -- allocation unit number
from X$KFFXP
where NUMBER_KFFXP=272 -- ASM file 272
AND GROUP_KFFXP=1 -- group number 1
order by 1;

PXN_KFFXP XNUM_KFFXP DISK_KFFXP   AU_KFFXP
---------- ---------- ---------- ----------
         0          0          0       1175
         1          0          3       1170
         2          1          3       1175
         3          1          2       1179
         4          2          1       1175
...

As expected, the file extents are spread over all disks and each extent is mirrored as disk group DATA is normal redundancy.

I also need to know the ASM allocation unit size.

SQL> select VALUE from V$ASM_ATTRIBUTE where NAME='au_size' and GROUP_NUMBER=1;

VALUE
-------
1048576

The allocation unit size is 1MB.

Where is my block

I know my data is in block 135 of ASM file 272. With block size of 8K each allocation unit can hold 128 blocks (128x8K=1MB). That means block 135 is block 7 (135-128=7) in the second allocation unit. The second allocation unit is 1175 on disk 3 (also allocation unit 1179 on disk 2 - remember this is a normal redundancy disk group, so my data is mirrored).

Let's get the names of disks 2 and 3.

SQL> select DISK_NUMBER, NAME from V$ASM_DISK where DISK_NUMBER in (2,3);

DISK_NUMBER NAME
----------- ------------------------------
2 ASMDISK3
3 ASMDISK4

SQL>

I am using ASMLIB, so at the OS level, those disks are /dev/oracleasm/disks/ASMDISK3 and /dev/oracleasm/disks/ASMDISK4.

Show me the money

Let's recap. My data (CAT) is 7 blocks (of size 8KB) into allocation unit 1175. That allocation unit is 1175MB into disk /dev/oracleasm/disks/ASMDISK4. Let's first extract that allocation unit.

# dd if=/dev/oracleasm/disks/ASMDISK4 bs=1024k count=1 skip=1175 of=/tmp/AU1175.dd
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.057577 seconds, 18.2 MB/s
# ls -l /tmp/AU1175.dd
-rw-r--r-- 1 root root 1048576 Oct 27 22:45 /tmp/AU1175.dd

Note the arguments to the dd command - bs=1024k (allocation unit size), skip=1175 (allocation unit I am interested in) and count=1 (I only need one allocation unit).

Let's now extract block 7 out of that allocation unit.

# dd if=/tmp/AU1175.dd bs=8k count=1 skip=7 of=/tmp/block135.dd

Note the arguments to the dd command now - bs=8k (data block size) and skip=7 (block I am interested in).

Let's now look at that block.

# od -c /tmp/block135.dd
...
0017760 \0 \0 , 001 002 002 301 002 003 C A T 001 006 332 217
0020000
#

At the bottom of that block I see my data (CAT). Remember that Oracle blocks are populated from the bottom up. I would see the same if I looked at allocation unit 1179 on disk /dev/oracleasm/disks/ASMDISK3.

Conclusion

To locate an Oracle data block in ASM, I had to know in which datafile that block was stored. I then queried X$KFFXP in ASM to see the extent distribution for that datafile. I also had to know both Oracle block size and ASM allocation unit size, to work out in which allocation unit my block was.

None of this is ASM or RDBMS version specific (except the query from V$ASM_ATTRIBUTE, as there is no such view in 10g). The ASM disk group redundancy was also irrelevant. Of course, with normal and high redundancy we will have multiple copies of data, but the method to find the data location is exactly the same for all types of disk group redundancy.

↧

Identification of under-performing disks in Exadata

May 19, 2013, 2:29 am

≫ Next: Auto disk management feature in Exadata

≪ Previous: Where is my data

Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.

About storage cell software processes

The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.

The Management Server (MS) provides storage cell management and configuration tasks.

Disk state changes

Possibly under-performing - confined online

When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.

That disk status change would be associated with the following entry in the storage cell alerthistory:

[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"

At the same time, the following will be logged in the storage cell alert log:

CDHS: Mark cd health state change [cell disk name] with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 11912
...

Prepare for test - confined offline

The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE

Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.

In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.

Under stress test

Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.

Poor performance - drop force

If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com, ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com, ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com, ClientPID = 26502
CDHS: Done cd health state change from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL

In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.

Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.

All well - back to normal

If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.

Disk confinement triggers

Any of the following conditions can trigger a disk confinement:

Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
Errors, e.g. I/O errors (CD_PERF_IOERR).

Conclusion

As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.

I have recently published this on MOS as Doc ID 1509105.1.

↧

Auto disk management feature in Exadata

June 13, 2013, 3:10 am

≫ Next: How many allocation units per file

≪ Previous: Identification of under-performing disks in Exadata

The automatic disk management feature is about automating ASM disk operations in an Exadata environment. The automation functionality applies to both planned actions (for example, deactivating griddisks in preparation for storage cell patching) and unplanned events (for example, disk failure).

Exadata disks

In an Exadata environment we have the following disk types:

Physicaldisk is a hard disk on a storage cell. Each storage cell has 12 physical disks, all with the same capacity (600 GB, 2 TB or 3 TB).
Flashdisk is a Sun Flash Accelerator PCIe solid state disk on a storage cell. Each storage cell has 16 flashdisks - 24 GB each in X2 (Sun Fire X4270 M2) and 100 GB each in X3 (Sun Fire X4270 M3) servers.
Celldisk is a logical disk created on every physicaldisk and every flashdisk on a storage cell. Celldisks created on physicaldisks are named CD_00_cellname, CD_01_cellname ... CD_11_cellname. Celldisks created on flashdisks are named FD_00_cellname, FD_01_cellname ... FD_15_cellname.
Griddisk is a logical disk that can be created on a celldisk. In a standard Exadata deployment we create griddisks on hard disk based celldisks only. While it is possible to create griddisks on flashdisks, this is not a standard practice; instead we use flash based celldisks for the flashcashe and flashlog.
ASM disk in an Exadata environment is a griddisk.

Automated disk operations

These are the disk operations that are automated in Exadata:

1. Griddisk status change to OFFLINE/ONLINE

If a griddisk becomes temporarily unavailable, it will be automatically OFFLINED by ASM. When the griddisk becomes available, it will be automatically ONLINED by ASM.

2. Griddisk DROP/ADD

If a physicaldisk fails, all griddisks on that physicaldisk will be DROPPED with FORCE option by ASM. If a physicaldisk status changes to predictive failure, all griddisks on that physical disk will be DROPPED by ASM. If a flashdisk performance degrades, the corresponding griddisks (if any) will be DROPPED with FORCE option by ASM.

When a physicaldisk is replaced, the celldisk and griddisks will be recreated by CELLSRV, and the griddisks will be automatically ADDED by ASM.

NOTE: If a griddisk in NORMAL state and in ONLINE mode status, is manually dropped with FORCE option (for example, by a DBA with 'alter diskgroup ... drop disk ... force'), it will be automatically added back by ASM. In other words, dropping a healthy disk with a force option will not achieve the desired effect.

3. Griddisk OFFLINE/ONLINE for rolling Exadata software (storage cells) upgrade

Before the rolling upgrade all griddisks will be inactivated on the storage cell by CELLSRV and OFFLINED by ASM. After the upgrade all griddisks will be activated on the storage cell and ONLINED in ASM.

4. Manual griddisk activation/inactivation

If a gridisk is manually inactivated on a storage cell, by running 'cellcli -e alter griddisk ... inactive', it will be automatically OFFLINED by ASM. When a gridisk is activated on a storage cell, it will be automatically ONLINED by ASM.

5. Griddisk confined ONLINE/OFFLINE

If a griddisk is taken offline by CELLSRV, because the underlying disk is suspected for poor performance, all griddisks on that celldisk will be automatically OFFLINED by ASM. If the tests confirm that the celldisk is performing poorly, ASM will drop all griddisks on that celldisk. If the tests find that the disk is actually fine, ASM will online all griddisks on that celldisk.

Software components

1. Cell Server (CELLSRV)

The Cell Server (CELLSRV) runs on the storage cell and it's the main component of Exadata software. In the context of automatic disk management, its tasks are to process the Management Server notifications and handle ASM queries about the state of griddisks.

2. Management Server (MS)

The Management Server (MS) runs on the storage cell and implements a web service for cell management commands, and runs background monitoring threads. The MS monitors the storage cell for hardware changes (e.g. disk plugged in) or alerts (e.g. disk failure), and notifies the CELLSRV about those events.

3. Automatic Storage Management (ASM)

The Automatic Storage Management (ASM) instance runs on the compute (database) node and has two processes that are relevant to the automatic disk management feature:

Exadata Automation Manager (XDMG) initiates automation tasks involved in managing Exadata storage. It monitors all configured storage cells for state changes, such as a failed disk getting replaced, and performs the required tasks for such events. Its primary tasks are to watch for inaccessible disks and cells and when they become accessible again, to initiate the ASM ONLINE operation.
Exadata Automation Manager (XDWK) performs automation tasks requested by XDMG. It gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Working together

All three software components work together to achieve automatic disk management.

In the case of disk failure, the MS detects that the disk has failed. It then notifies the CELLSRV about it. If there are griddisks on the failed disk, the CELLSRV notifies ASM about the event. ASM then drops all griddisks from the corresponding disk groups.

In the case of a replacement disk inserted into the storage cell, the MS detects the new disk and checks the cell configuration file to see if celldisk and griddisks need to be created on it. If yes, it notifies the CELLSRV to do so. Once finished, the CELLSRV notifies ASM about new griddisks and ASM then adds them to the corresponding disk groups.

In the case of a poorly performing disk, the CELLSRV first notifies ASM to offline the disk. If possible, ASM then offlines the disk. One example when ASM would refuse to offline the disk, is when a partner disk is already offline. Offlining the disk would result in the disk group dismount, so ASM would not do that. Once the disk is offlined by ASM, it notifies the CELLSRV that the performance tests can be carried out. Once done with the tests, the CELLSRV will either tell ASM to drop that disk (if it failed the tests) or online it (if it passed the test).

The actions by MS, CELLSRV and ASM are coordinated in a similar fashion, for other disk events.

ASM initialization parameters

The following are the ASM initialization parameters relevant to the auto disk management feature:

_AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature set this parameter to FALSE. Range of values: TRUE [default] or FALSE.
_AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic operation. Range of values: 1-10. Default value is 2.
_AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk. Range of values: 1-10. Default value is 3.

All three parameters are static, which means they require ASM instances restart. Note that all these are hidden (underscore) parameters that should not be modified unless advised by Oracle Support.

Files

The following are the files relevant to the automatic disk management feature:

1. Cell configuration file - $OSSCONF/cell_disk_config.xml. An XML file on the storage cell that contains information about all configured objects (storage cell, disks, IORM plans, etc) except alerts and metrics. The CELLSRV reads this file during startup and writes to it when an object is updated (e.g. updates to IORM plan).

2. Grid disk file - $OSSCONF/griddisk.owners.dat. A binary file on the storage cell that contains the following information for all griddisks:

ASM disk name
ASM disk group name
ASM failgroup name
Cluster identifier (which cluster this disk belongs to)
Requires DROP/ADD (should the disk be dropped from or added to ASM)

3. MS log and trace files - ms-odl.log and ms-odl.trc in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

4. CELLSRV alert log - alert.log in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

5. ASM alert log - alert_+ASMn.log in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

6. XDMG and XDWK trace files - +ASMn_xdmg_nnnnn.trc and +ASMn_xdwk_nnnnn.trc in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

Conclusion

In an Exadata environment, the ASM has been enhanced to provide the automatic disk management functionality. Three software components that work together to provide this facility are the Exadata Cell Server (CELLSRV), Exadata Management Server (MS) and Automatic Storage Management (ASM).

I have also published this via MOS as Doc ID 1484274.1.

↧

How many allocation units per file

June 16, 2013, 4:30 am

≫ Next: ASM version 12c is out

≪ Previous: Auto disk management feature in Exadata

This post is about the amount of space allocated to ASM based files.

The smallest amount of space ASM allocates is an allocation unit (AU). The default AU size is 1 MB, except in Exadata where the default AU size is 4 MB.

The space for ASM based files is allocated in extents, which consist of one or more AUs. In version 11.2, the first 20000 extents consist of 1 AU, next 20000 extents have 4 AUs, and extents beyond that have 16 AUs. This is known as variable size extent feature. In version 11.1, the extent growth was 1-8-64 AUs. In version 10, we don't have variable size extents, so all extents sizes are exactly 1 AU.

Bytes vs space

The definition for V$ASM_FILE view says the following for BYTES and SPACE columns:

BYTES - Number of bytes in the file
SPACE - Number of bytes allocated to the file

There is a subtle difference in the definition and very large difference in numbers. Let's have a closer look. For the examples in this post I will use database and ASM version 11.2.0.3, with ASMLIB based disks.

First get some basic info about disk group DATA where most of my datafiles are. Run the following SQL connected to the database instance.

SQL> select NAME, GROUP_NUMBER, ALLOCATION_UNIT_SIZE/1024/1024 "AU size (MB)", TYPE
from V$ASM_DISKGROUP
where NAME='DATA';

NAME GROUP_NUMBER AU size (MB) TYPE
---------------- ------------ ------------ ------
DATA 1 1 NORMAL

Now create one small file (under 60 extents) and one large file (over 60 extents).

SQL> create tablespace T1 datafile '+DATA' size 10 M;

Tablespace created.

SQL> create tablespace T2 datafile '+DATA' size 100 M;

Tablespace created.

Get the ASM file numbers for those two files:

SQL> select NAME, round(BYTES/1024/1024) "MB" from V$DATAFILE;

NAME MB
------------------------------------------ ----------
...
+DATA/br/datafile/t1.272.818281717 10
+DATA/br/datafile/t2.271.818281741 100

The small file is ASM file number 272 and the large file is ASM file number 271.

Get the bytes and space information (in AUs) for these two files.

SQL> select FILE_NUMBER, round(BYTES/1024/1024) "Bytes (AU)", round(SPACE/1024/1024) "Space (AUs)", REDUNDANCY
from V$ASM_FILE
where FILE_NUMBER in (271, 272) and GROUP_NUMBER=1;

FILE_NUMBER Bytes (AU) Space (AUs) REDUND
----------- ---------- ----------- ------
272 10 22 MIRROR
271 100 205 MIRROR

The bytes shows the actual file size. For the small file, bytes shows the file size is 10 AUs = 10 MB (the AU size is 1 MB). The space required for the small file is 22 AUs. 10 AUs for the actual datafile, 1 AU for the file header and because the file is mirrored, double that, so 22 AUs in total.

For the large file, bytes shows the file size is 100 AUs = 100 MB. So far so good. But the space required for the large file is 205 AUs, not 202 as one might expect. What are those extra 3 AUs for? Let's find out.

ASM space

The following query (run in ASM instance) will show us the extent distribution for ASM file 271.

SQL> select XNUM_KFFXP "Virtual extent", PXN_KFFXP "Physical extent", DISK_KFFXP "Disk number", AU_KFFXP "AU number"
from X$KFFXP
where GROUP_KFFXP=1 and NUMBER_KFFXP=271
order by 1,2;

Virtual extent Physical extent Disk number AU number
-------------- --------------- ----------- ----------
0 0 3 1155
0 1 0 1124
1 2 0 1125
1 3 2 1131
2 4 2 1132
2 5 0 1126
...
100 200 3 1418
100 201 1 1412
2147483648 0 3 1122
2147483648 1 0 1137
2147483648 2 2 1137

205 rows selected.

As the file is mirrored, we see that each virtual extent has two physical extents. But the interesting part of the result are the last three allocation units for virtual extent number 2147483648, that is triple mirrored. We will have a closer look at those with kfed, and for that we will need disk names.

Get the disk names.

SQL> select DISK_NUMBER, PATH
from V$ASM_DISK
where GROUP_NUMBER=1;

DISK_NUMBER PATH
----------- ---------------
0 ORCL:ASMDISK1
1 ORCL:ASMDISK2
2 ORCL:ASMDISK3
3 ORCL:ASMDISK4

Let's now check what type of data is in those allocation units.

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=1122 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK1 aun=1137 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1137 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

These additional allocation units hold ASM metadata for the large file. More specifically they hold extent map information that could not fit into the the ASM file directory block. The file directory needs extra space to keep track of files larger than 60 extents, so it needs an additional allocation unit to do so. While the file directory needs only few extra ASM metadata blocks, the smallest unit of space the ASM can allocate is an AU. And because this is metadata, this AU is triple mirrored (even in a normal redundancy disk group), hence 3 extra allocation units for the large file. In an external redundancy disk group, there would be only one extra AU per large file.

Conclusion

The amount of space ASM needs for a file, depends on two factors - the file size and the disk group redundancy.

In an external redundancy disk group, the required space will be the file size + 1 AU for the file header + 1 AU for indirect extents if the file is larger than 60 AUs.

In a normal redundancy disk group, the required space will be twice the file size + 2 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

In a high redundancy disk group, the required space will be three times the file size + 3 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

↧

ASM version 12c is out

August 14, 2013, 3:41 am

≫ Next: Physical metadata replication

≪ Previous: How many allocation units per file

Oracle Database version 12c has been released, which means a brand new version of ASM is out! Notable new features are Flex ASM, proactive data validation and better handling of disk management operations. Let's have an overview with more details in separate posts.

Flex ASM

No need to run ASM instances on all nodes in the cluster. In a default installation there would be three ASM instances, irrespective of the number of nodes in the cluster. An ASM instance can serve both local and remote databases. If an ASM instance fails, the database instances do not crash; instead they fail over to another ASM instance in the cluster.

Flex ASM introduces new instance type - an I/O server or ASM proxy instance. There will be a few (default is 3) I/O server instances in Oracle flex cluster environment, serving indirect clients (typically an ACFS cluster file system). An I/O server instance can run on the same node as ASM instance or on a different node in a flex cluster. In all cases, an I/O server instance needs to talk to a flex ASM instance to get metadata information on behalf of an indirect client.

The flex ASM is an optional feature in 12c.

Physical metadata replication

In addition to replicating the disk header (available since 11.1.0.7), ASM 12c also replicates the allocation table, within each disk. This makes ASM more resilient to bad disk sectors and external corruptions. The disk group attribute PHYS_META_REPLICATED is provided to track the replication status of a disk group.

$ asmcmd lsattr -G DATA -l phys_meta_replicated
NameValue
phys_meta_replicatedtrue

The physical metadata replication status flag is in the disk header (kfdhdb.flags). This flag only ever goes from 0 to 1 (once the physical metadata has been replicated) and it never goes back to 0.

More storage

ASM 12c supports 511 disk groups, with the maximum disk size of 32 PB.

Online with power

ASM 12c has a fast mirror resync power limit to control resync parallelism and improve performance. Disk resync checkpoint functionality provides faster recovery from instance failures by enabling the resync to resume from the point at which the process was interrupted or stopped, instead of starting from the beginning. ASM 12c also provides a time estimate for the completion of a resync operation.

Use power limit for disk resync operations, similar to disk rebalance, with the range from 1 to 1024:

$ asmcmd online -G DATA -D DATA_DISK1 --power 42

Disk scrubbing - proactive data validation and repair

In ASM 12c the disk scrubbing checks for logical data corruptions and repairs them automatically in normal and high redundancy disk groups. This is done during disk group rebalance if a disk group attribute CONTENT.CHECK is set to TRUE. The check can also be performed manually by running ALTER DISKGROUP SCRUB command.

The scrubbing can be performed at the disk group, disk or a file level and can be monitored via V$ASM_OPERATION view.

Even read for disk groups

In previous ASM versions, the data was always read from the primary copy (in a normal or high redundancy disk groups) unless a preferred failgroup was set up. The data from the mirror would be read only if the primary copy of the data was unavailable. With the even read feature, each request to read can be sent to the least loaded of the possible source disks. The least loaded in this context is simply the disk with the least number of read requests.

Even read functionality is enabled by default on all Oracle Database and Oracle ASM instances of version 12.1 and higher in non-Exadata environments. The functionality is enabled in an Exadata environment when there is a failure. Even read functionality is applicable only to disk groups with normal or high redundancy.

Replace an offline disk

We now have a new ALTER DISKGROUP REPLACE DISK command, that is a mix of the rebalance and fast mirror resync functionality. Instead of a full rebalance, the new, replacement disk, is populated with data read from the surviving partner disks only. This effectively reduces the time to replace a failed disk.

Note that the disk being replaced must be in OFFLINE state. If the disk offline timer has expired, the disk is dropped, which initiates the rebalance. On a disk add, there will be another rebalance.

ASM password file in a disk group

ASM version 11.2 allowed ASM spfile to be placed in a disk group. In 12c we can also put ASM password file in an ASM disk group. Unlike ASM spfile, the access to the ASM password file is possible only after ASM startup and once the disk group containing the password is mounted.

The orapw utility now accepts ASM disk group as a password destination. The asmcmd has also been enhanced to allow ASM password management.

Failgroup repair timer

We now have a failgroup repair timer with the default value of 24 hours. Note that the disk repair timer still defaults to 3.6 hours.

Rebalance rebalanced

The rebalance work is now estimated based on the detailed work plan, that can be generated and viewed separately. We now have a new EXPLAIN WORK command and a new V$ASM_ESTIMATE view.

In ASM 12c we (finally) have a priority ordered rebalance - the critical files (typically control files and redo logs) are rebalanced before other database files.

In Exadata, the rebalance can be offloaded to storage cells.

Thin provisioning support

ASM 12c enables thin provisioning support for some operations (that are typically associated with the disk group rebalance). The feature is disabled by default, and can be enabled at the disk group creation time or later by setting disk group attribute THIN_PROVISIONED to TRUE.

Enhanced file access control (ACL)

Easier file ownership and permission changes, e.g. a file permission can be changed on an open file. ACL has also been implemented for Microsoft Windows OS.

Oracle Cluster Registry (OCR) backup in ASM disk group

Storing the OCR backup in an Oracle ASM disk group simplifies OCR management by permitting access to the OCR backup from any node in the cluster should an OCR recovery become necessary.

Use ocrconfig command to specify an OCR backup location in an Oracle ASM disk group:

# ocrconfig –backuploc +DATA

↧

Physical metadata replication

August 17, 2013, 3:51 am

≫ Next: Free Space Table

≪ Previous: ASM version 12c is out

Starting with version 12.1, ASM replicates the physically addressed metadata. This means that ASM maintains two copies of the disk header, the Free Space Table and the Allocation Table data. Note that this metadata is not mirrored, but replicated. ASM mirroring refers to copies of the same data on different disks. The copies of the physical metadata are on the same disk, hence the term replicated. This also means that the physical metadata is replicated even in an external redundancy disk group.

The Partnership and Status Table (PST) is also referred to as physically addressed metadata, but the PST is not replicated. This is because the PST is protected by mirroring - in normal and high redundancy disk groups.

Where is the replicated metadata

The physically addressed metadata is in allocation unit 0 (AU0) on every ASM disk. With this feature enabled, ASM will copy the contents of AU0 into allocation unit 11 (AU11), and from that point on, it will maintain both copies. This feature will be automatically enabled when a disk group is created with ASM compatibility of 12.1 or higher, or when ASM compatibility is advanced to 12.1 or higher, for an existing disk group.

If there is data in AU11, when the ASM compatibility is advanced to 12.1 or higher, ASM will simply move that data somewhere else, and use AU11 for the physical metadata replication.

Since version 11.1.0.7, ASM keeps a copy of the disk header in the second last block of AU1. Interestingly, in version 12.1, ASM still keeps the copy of the disk header in AU1, which means that now every ASM disk will have three copies of the disk header block.

Disk group attribute PHYS_META_REPLICATED

The status of the physical metadata replication can be checked by querying the disk group attribute PHYS_META_REPLICATED. Here is an example with the asmcmd command that shows how to check the replication status for disk group DATA:

$ asmcmd lsattr -G DATA -l phys_meta_replicated
NameValue
phys_meta_replicatedtrue

The phys_meta_replicated=true means that the physical metadata for disk group DATA has been replicated.

The kfdhdb.flags field in the ASM disk header indicates the status of the physical metadata replication as follows:

kfdhdb.flags = 0 - no physical data has been replicated
kfdhdb.flags = 1 - physical data has been replicated
kfdhdb.flags = 2 - physical data replication in progress

Once the flag is set to 1, it will never go back to 0.

Metadata replication in action

As stated earlier, the physical metadata will be replicated in disk groups with ASM compatibility of 12.1 or higher. Let's first have a look at a disk group with ASM compatible set to 12.1:

$ asmcmd lsattr -G DATA -l compatible.asm
Name Value
compatible.asm 12.1.0.0.0
$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

This shows that the physical metadata has been replicated. Now verify that all disks in the disk group have the kfdhdb.flags set to 1:

$ for disk in `asmcmd lsdsk -G DATA --suppressheader`; do kfed read $disk | egrep "dskname|flags"; done
kfdhdb.dskname: DATA_0000 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0001 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0002 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0003 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

This shows that all disks have the replication flag set to 1, i.e. that the physical metadata has been replicated for all disks in the disk group.

Let's now have a look at a disk group with ASM compatibility 11.2, that is later advanced to 12.1:

SQL> create diskgroup DG1 external redundancy
2 disk '/dev/sdi1'
3 attribute 'COMPATIBLE.ASM'='11.2';

Diskgroup created.

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name Value

Nothing - no such attribute. That is because the ASM compatibility is less than 12.1. We also expect that the kfdhdb.flags is 0 for the only disk in that disk group:

$ kfed read /dev/sdi1 | egrep "type|dskname|grpname|flags"
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.grpname: DG1 ; 0x048: length=3
kfdhdb.flags: 0 ; 0x0fc: 0x00000000

Let's now advance the ASM compatibility to 12.1:

$ asmcmd setattr -G DG1 compatible.asm 12.1.0.0.0

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name Value
phys_meta_replicated true

The physical metadata has been replicated, so we should now see the kfdhdb.flags set to 1:

$ kfed read /dev/sdi1 | egrep "dskname|flags"
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

The physical metadata should be replicated in AU11:

$ kfed read /dev/sdi1 aun=11 | egrep "type|dskname|flags"
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

$ kfed read /dev/sdi1 aun=11 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
$ kfed read /dev/sdi1 aun=11 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

This shows that the AU11 has the copy of the data from AU0.

Finally check for the disk header copy in AU1:

$ kfed read /dev/sdi1 aun=1 blkn=254 | grep type
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD

This shows that there is also a copy of the disk header in the second last block of AU1.

Conclusion

ASM version 12 replicates the physically addressed metadata, i.e. it keeps the copy of AU0 in AU11 - on the same disk. This allows ASM to automatically recover from damage to any data in AU0. Note that ASM will not be able to recover from loss of any other data in an external redundancy disk group. In a normal redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in a single failgroup. In a high redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in any two failgroups.

↧

Free Space Table

August 23, 2013, 1:01 am

≫ Next: Allocation Table

≪ Previous: Physical metadata replication

The ASM Free Space Table (FST) provides a summary of which allocation table blocks have free space. It contains an array of bit patterns indexed by allocation table block number. The table is used to speed up the allocation of new allocation units by avoiding reading blocks that are full.

The FST is technically part of the Allocation Table (AT), and is at block 1 of the AT. The Free Space Table, and the Allocation Table are so called physically addressed metadata, as they are always at the fixed location on each ASM disk.

Locating the Free Space Table

The location of the FST block is stored in the ASM disk header (field kfdhdb.fstlocn). In the following example, the lookup of that field in the disk header, shows that the FST is in block 1.

$ kfed read /dev/sdc1 | grep kfdhdb.fstlocn
kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001

Let’s have a closer look at the FST:

$ kfed read /dev/sdc1 blkn=1 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
...
kfdfsb.aunum: 0 ; 0x000: 0x00000000
kfdfsb.max: 254 ; 0x004: 0x00fe
kfdfsb.cnt: 254 ; 0x006: 0x00fe
kfdfsb.bound: 0 ; 0x008: 0x0000
kfdfsb.flag: 1 ; 0x00a: B=1
kfdfsb.ub1spare: 0 ; 0x00b: 0x00
kfdfsb.spare[0]: 0 ; 0x00c: 0x00000000
kfdfsb.spare[1]: 0 ; 0x010: 0x00000000
kfdfsb.spare[2]: 0 ; 0x014: 0x00000000
kfdfse[0].fse: 119 ; 0x018: FREE=0x7 FRAG=0x7
kfdfse[1].fse: 16 ; 0x019: FREE=0x0 FRAG=0x1
kfdfse[2].fse: 16 ; 0x01a: FREE=0x0 FRAG=0x1
kfdfse[3].fse: 16 ; 0x01b: FREE=0x0 FRAG=0x1
...
kfdfse[4037].fse: 0 ; 0xfdd: FREE=0x0 FRAG=0x0
kfdfse[4038].fse: 0 ; 0xfde: FREE=0x0 FRAG=0x0
kfdfse[4039].fse: 0 ; 0xfdf: FREE=0x0 FRAG=0x0

For this FST block, the first allocation table block is in AU 0:

kfdfsb.aunum: 0 ; 0x000: 0x00000000

Maximum number of the FST entries this block can hold is 254:

kfdfsb.max: 254 ; 0x004: 0x00fe

How many Free Space Tables

Large ASM disks may have more than one stride. The field kfdhdb.mfact in the ASM disk header, shows the stride size - expressed in allocation units. Each stride will have its own physically addressed metadata, which means that it will have its own Free Space Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the FST for the second stride. That should be in block 1 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC

As expected, we have another FTS in AU113792. If we had another stride, there would be another FST at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the FST at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC

Conclusion

The Free Space Table is in block 1 of allocation unit 0 of every ASM disks. If the disk has more than one stride, each stride will have its own Free Space Table.

↧

Allocation Table

August 24, 2013, 4:05 am

≫ Next: Partnership and Status Table

≪ Previous: Free Space Table

Every ASM disk contains at least one Allocation Table (AT) that describes the contents of the disk. The AT has one entry for every allocation unit (AU) on the disk. If an AU is allocated, the Allocation Table will have the extent number and the file number the AU belongs to.

Finding the Allocation Table

The location of the first block of the Allocation Table is stored in the ASM disk header (field kfdhdb.altlocn). In the following example, the look up of that field shows that the AT starts at block 2.

$ kfed read /dev/sdc1 | grep kfdhdb.altlocn
kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002

Let’s have a closer look at the first block of the Allocation Table.

$ kfed read /dev/sdc1 blkn=2 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum: 0 ; 0x000: 0x00000000
kfdatb.shrink: 448 ; 0x004: 0x01c0
...

The kfdatb.aunum=0, means that AU0 is the first AU described by this AT block. The kfdatb.shrink=448 means that this AT block can hold the information for 448 AUs. In the next AT block we should see kfdatb.aunum=448, meaning that it will have the info for AU448 + 448 more AUs. Let’s have a look:

$ kfed read /dev/sdc1 blkn=3 | grep kfdatb.aunum
kfdatb.aunum: 448 ; 0x000: 0x000001c0

The next AT block should show kfdatb.aunum=896:

$ kfed read /dev/sdc1 blkn=4 | grep kfdatb.aunum
kfdatb.aunum: 896 ; 0x000: 0x00000380

And so on...

Allocation table entries

For allocated AUs, the Allocation Table entry (kfdate[i]) holds the extent number, file number and the state of the allocation unit - normally allocated (flag V=1), vs a free or unallocated AU (flag V=0).

Let’s have a look at Allocation Table block 3.

$ kfed read /dev/sdc1 blkn=3 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum: 448 ; 0x000: 0x000001c0
...
kfdate[142].discriminator: 1 ; 0x498: 0x00000001
kfdate[142].allo.lo: 0 ; 0x498: XNUM=0x0
kfdate[142].allo.hi: 8388867 ; 0x49c: V=1 I=0 H=0 FNUM=0x103
kfdate[143].discriminator: 1 ; 0x4a0: 0x00000001
kfdate[143].allo.lo: 1 ; 0x4a0: XNUM=0x1
kfdate[143].allo.hi: 8388867 ; 0x4a4: V=1 I=0 H=0 FNUM=0x103
kfdate[144].discriminator: 1 ; 0x4a8: 0x00000001
kfdate[144].allo.lo: 2 ; 0x4a8: XNUM=0x2
kfdate[144].allo.hi: 8388867 ; 0x4ac: V=1 I=0 H=0 FNUM=0x103
kfdate[145].discriminator: 1 ; 0x4b0: 0x00000001
kfdate[145].allo.lo: 3 ; 0x4b0: XNUM=0x3
kfdate[145].allo.hi: 8388867 ; 0x4b4: V=1 I=0 H=0 FNUM=0x103
kfdate[146].discriminator: 1 ; 0x4b8: 0x00000001
kfdate[146].allo.lo: 4 ; 0x4b8: XNUM=0x4
kfdate[146].allo.hi: 8388867 ; 0x4bc: V=1 I=0 H=0 FNUM=0x103
kfdate[147].discriminator: 1 ; 0x4c0: 0x00000001
kfdate[147].allo.lo: 5 ; 0x4c0: XNUM=0x5
kfdate[147].allo.hi: 8388867 ; 0x4c4: V=1 I=0 H=0 FNUM=0x103
kfdate[148].discriminator: 0 ; 0x4c8: 0x00000000
kfdate[148].free.lo.next: 16 ; 0x4c8: 0x0010
kfdate[148].free.lo.prev: 16 ; 0x4ca: 0x0010
kfdate[148].free.hi: 2 ; 0x4cc: V=0 ASZM=0x2
kfdate[149].discriminator: 0 ; 0x4d0: 0x00000000
kfdate[149].free.lo.next: 0 ; 0x4d0: 0x0000
kfdate[149].free.lo.prev: 0 ; 0x4d2: 0x0000
kfdate[149].free.hi: 0 ; 0x4d4: V=0 ASZM=0x0
...

The excerpt shows the Allocation Table entries for file 259 (hexadecimal FNUM=0x103), which start at kfdate[142] and end at kfdate[147]. That shows the ASM file 259 has the total of 6 AUs. The AU numbers will be the index of kfdate[i] + offset (kfdatb.aunum=448). In other words, 142+448=590, 143+448=591 ... 147+448=595. Let's verify that by querying X$KFFXP:

SQL> select AU_KFFXP
from X$KFFXP
where GROUP_KFFXP=1 -- disk group 1
and NUMBER_KFFXP=259 -- file 259
;

AU_KFFXP
----------
590
591
592
593
594
595

6 rows selected.

Free space

In the above kfed output, we see that kfdate[148] and kfdate[149] have the word free next to them, which marks them as free or unallocated allocation units (flagged with V=0). That kfed output is truncated, but there are many more free allocation units described by this AT block.

The stride

Each AT block can describe 448 AUs (the kfdatb.shrink value from the Allocation Table), and the whole AT can have 254 blocks (the kfdfsb.max value from the Free Space Table). This means that one Allocation Table can describe 254x448=113792 allocation units. This is called the stride, and the stride size - expressed in number of allocation units - is in the field kfdhdb.mfact, in ASM disk header:

$ kfed read /dev/sdc1 | grep kfdhdb.mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

The stride size in this example is for the AU size of 1MB, that can fit 256 metadata blocks in AU0. Block 0 is for the disk header and block 1 is for the Free Space Table, which leaves 254 blocks for the Allocation Table blocks.

With the AU size of 4MB (default in Exadata), the stride size will be 454272 allocation units or 1817088 MB. With the larger AU size, the stride will also be larger.

How many Allocation Tables

Large ASM disks may have more than one stride. Each stride will have its own physically addressed metadata, which means that it will have its own Allocation Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the AT entries for the second stride. Those should be in blocks 2-255 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
$ kfed read /dev/sdc1 aun=113792 blkn=255 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

As expected, we have another AT in AU113792. If we had another stride, there would be another AT at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the AT at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

Conclusion

Every ASM disk contains at least one Allocation Table that describes the contents of the disk. The AT has one entry for every allocation unit on the disk. If the disk has more than one stride, each stride will have its own Allocation Table.

↧

Partnership and Status Table

August 31, 2013, 5:13 am

≫ Next: ASM metadata blocks

≪ Previous: Allocation Table

The Partnership and Status Table (PST) contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data.

PST count

In an external redundancy disk group there will be only one copy of the PST.

In a normal redundancy disk group there will be at least two copies of the PST. If there are three or more failgroups, there will be three copies of the PST.

In a high redundancy disk group there will be at least three copies of the PST. If thre are four failgroups, there will be four PST copies, and if there are five or more failgroups there will be five copies of the PST.

Let's have a look. Note that in each example, the disk group is created with five disks.

External redundancy disk group.

SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log:

Sat Aug 31 20:44:59 2013
SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:44:59 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:45:00 2013
GMON updating group 2 at 50 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
Sat Aug 31 20:45:00 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates only one copy of the PST.

Normal redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:49:28 2013
SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:49:28 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:49:28 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:49:28 2013
GMON updating group 2 at 68 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
Sat Aug 31 20:49:28 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates three copies of the PST.

High redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:51:52 2013
SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:51:52 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:51:53 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:51:53 2013
GMON updating group 2 at 77 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
NOTE: group DG1: initial PST location: disk 0003 (PST copy 3)
NOTE: group DG1: initial PST location: disk 0004 (PST copy 4)
Sat Aug 31 20:51:53 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates five copies of the PST.

PST relocation

The PST would be relocated in the following cases

The disk with the PST is not available (on ASM startup)
The disk goes offline
There was an I/O error while reading/writing to/from the PST
Disk is dropped gracefully

In all cases the PST would be relocated to another disk in the same failgroup (if a disk is available in the same failure group) or to another failgroup (that doesn't already contain a copy of the PST).

Let's have a look.

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8';

Diskgroup created.

ASM alert log shows the PST copies are on disks 0, 1 and 3:

NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)

Let's drop disk 0:

SQL> select disk_number, name, path from v$asm_disk_stat
where group_number = (select group_number from v$asm_diskgroup_stat where name='DG1');

DISK_NUMBER NAME PATH
----------- ------------------------------ ----------------
3 DG1_0003 /dev/sdc8
2 DG1_0002 /dev/sdc7
1 DG1_0001 /dev/sdc6
0 DG1_0000 /dev/sdc5

SQL> alter diskgroup DG1 drop disk DG1_0000;

Diskgroup altered.

ASM alert log

Sat Aug 31 21:04:29 2013
SQL> alter diskgroup DG1 drop disk DG1_0000
...
NOTE: initiating PST update: grp 2 (DG1), dsk = 0/0xe9687ff6, mask = 0x6a, op = clear
Sat Aug 31 21:04:37 2013
GMON updating disk modes for group 2 at 96 for pid 24, osid 16502
NOTE: group DG1: updated PST location: disk 0001 (PST copy 0)
NOTE: group DG1: updated PST location: disk 0002 (PST copy 1)
NOTE: group DG1: updated PST location: disk 0003 (PST copy 2)
...

We see that the PST copy from disk 0 was moved to disk 3.

Disk Partners

A disk partnership is a symmetric relationship between two disks in a high or normal redundancy disk group. There is no disk partnership in an external disk groups. For a discussion on this topic, please see the post How many partners.

PST Availability

The PST has to be available before the rest of ASM metadata. When the disk group mount is requested, the GMON process (on the instance requesting a mount) reads all disks in the disk group to find and verify all available PST copies. Once it verifies that there are enough PSTs for a quorum, it mounts the disk group. From that point on, the PST is available in the ASM instance cache, stored in the GMON PGA and protected by an exclusive lock on the PT.n.0 enqueue.

As other ASM instances, in the same cluster, come online they cache the PST in their GMON PGA with shared PT.n.0 enqueue.

Only the GMON (the CKPT in 10gR1) that has an exclusive lock on the PT enqueue, can update the PST information on disks.

PST (GMON) tracing

The GMON trace file will log the PST info every time a disk group mount is attempted. Note that I said attempted, not mounted, as the GMON will log the information regardless of the mount being successful or no. This information may be valuable to Oracle Support in diagnosing disk group mount failures.

This would be a typical information logged in the GMON trace file on a disk group mount:

=============== PST ====================
grpNum: 2
grpTyp: 2
state: 1
callCnt: 103
bforce: 0x0
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next: 7
last: 7
pst count: 3
pst locations: 1 2 3
incarn: 4
dta size: 4
version: 0
ASM version: 168820736 = 10.1.0.0.0
contenttype: 0
--------------- LOC MAP ----------------
0: dirty 0 cur_loc: 0 stable_loc: 0
1: dirty 0 cur_loc: 0 stable_loc: 0
--------------- DTA --------------------
1: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 3 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 2 (amp)
...

The section marked === PST === tells us the group number (grpNum), type (grpTyp) and state. The section marked --- HDR --- shows the number of PST copies (pst count) and the disk numbers that have those copies (pst locations). The secion marked --- DTA --- shows the actual state of the disks with the PST.

Conclusion

The Partnership and Status Table contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data. As the PST is a valuable ASM metadata, it is mirrored three times in a normal redundancy disk group and five times in a high redundancy disk group - provided there are enough failgroups of course.

↧

ASM metadata blocks

September 28, 2013, 7:23 am

≫ Next: Tell me about your ASM

≪ Previous: Partnership and Status Table

An ASM instance manages metadata needed to make ASM files available to Oracle databases and other ASM clients. ASM metadata is stored in disk groups and organised in metadata structures. These metadata structures consist of one or more ASM metadata blocks. For example, the ASM disk header consist of a single ASM metadata block. Other structures, like the Partnership and Status Table, consist of exactly one allocation unit (AU). Some ASM metadata structures, like the File Directory, can span multiple AUs and will not have the predefined size; in fact, the File Directory will grow as needed and will be managed as any other ASM file.

ASM metadata block types

The following are the ASM metadata block types:

KFBTYP_DISKHEAD - ASM disk header block. This block will be the very first block in every ASM disk. A copy of this block will be in the second last PST block (11.1.0.7 and later). The copy of this block will also be in the very first block in AU11, for disk groups with COMPATIBLE.ASM=12.1 and later.
KFBTYP_FREESPC - Free Space Table block.
KFBTYP_ALLOCTBL - Allocation Table block.
KFBTYP_PST_META - Partnership and Status Table (PST) block. The PST blocks 0 and 1 will be of this type.
KFBTYP_PST_DTA - Partnership and Status Table (PST) block. The PST blocks with actual PST data.
KFBTYP_PST_NONE - Partnership and Status Table (PST) block. The PST block with no PST data. Remember that AU1 on every disk is reserved for the PST, but only some disks will have the PST data.
KFBTYP_HBEAT - The heartbeat block. This block is in the Partnership and Status Table (PST).
KFBTYP_FILEDIR - File Directory block.
KFBTYP_INDIRECT - File Directory block, containing a pointer to another file directory block.
KFBTYP_LISTHEAD - Disk Directory block. The very first block in the ASM disk directory. The field kfdhdb.f1b1locn in the ASM disk header will point the the allocation unit whose block 0 will be of this type.
KFBTYP_DISKDIR - Disk Directory block. The rest of the blocks in the ASM disk directory.
KFBTYP_ACDC - Active Change Directory (ACD) block. The very first block of the ACD will be of this type.
KFBTYP_CHNGDIR - Active Change Directory (ACD) block. The blocks with the ACD data.
KFBTYP_COD_BGO - Continuing Operations Directory (COD) block for background operations data.
KFBTYP_COD_RBO - Continuing Operations Directory (COD) block for rollback operations data.
KFBTYP_COD_DATA - Continuing Operations Directory (COD) block with the actual rollback operations data.
KFBTYP_TMPLTDIR - Template Directory block.
KFBTYP_ALIASDIR - Alias Directory block.
KFBTYP_SR - Staleness Registry block.
KFBTYP_STALEDIR - Staleness Directory block.
KFBTYP_VOLUMEDIR - ADVM Volume Directory block.
KFBTYP_ATTRDIR - Attributes Directory block.
KFBTYP_USERDIR - User Directory block.
KFBTYP_GROUPDIR - User Group Directory block.
KFBTYP_USEDSPC - Disk Used Space Directory block.
KFBTYP_ASMSPFALS - ASM spfile alias block.
KFBTYP_PASWDDIR - Password Directory block.
KFBTYP_INVALID - Not an ASM metadata block.

Note that the KFBTYP_INVALID is not an actual block type stored in ASM metadata block. Instead, ASM will return this if it encounters a block where the type is not one of the valid ASM metadata block types. For example if the ASM disk header is corrupt, say zeroed out, ASM will report it as KFBTYP_INVALID. We will also see the same when reading such block with the kfed tool.

ASM metadata block structure

The default ASM metadata block size is 4096 bytes. The block size will be specified in the ASM disk header field kfdhdb.blksize. Note that the ASM metadata block size has nothing to do with the database block size.

The first 32 bytes of every ASM metadata block contains the block header (not to be confused with the ASM disk header). The block header has the following information:

kfbh.endian - Platform endianness.
kfbh.hard - H.A.R.D. (Hardware Assisted Resilient Data) signature.
kfbh.type - Block type.
kfbh.datfmt - Block data format.
kfbh.block.blk - Location (block number) of this block.
kfbh.block.obj - Data type held in this block.
kfbh.check - Block checksum.
kfbh.fcn.base - Block change control number (base).
kfbh.fcn.wrap - Block change control number (wrap).

The FCN is the ASM equivalent of database SCN.

The rest of the contents of an ASM metadata block will be type specific. In other words, an ASM disk header block will have the disk header specific data, like disk number, disk name, disk group name etc. A file directory block will have the extent location data for a file, etc.

Conclusion

An ASM instance manages ASM metadata blocks. It creates them, updates them, calculates and updates the checksum on writes, reads and verifies the checksums on reads, exchanges the blocks with other instances etc. ASM metadata structures consist of one of more ASM metadata blocks. A tool like kfed can be used to read ASM metadata blocks from ASM disks.

↧

Tell me about your ASM

December 25, 2013, 3:23 am

≫ Next: The ASM password directory

≪ Previous: ASM metadata blocks

When diagnosing ASM issues, it helps to know a bit about the setup - disk group names and types, the state of disks, ASM instance initialisation parameters and if any rebalance operations are in progress. In those cases I usually ask for an HTML report, that gets produced by running my SQL script against one of the ASM instances. This post is about that script with the comments about the output.

The script

First, here is the script, lets say saved as asm_report.sql:

spool /tmp/ASM_report.html
set markup html on
set echo off
set feedback off
set pages 10000
break on INST_ID on GROUP_NUMBER
prompt ASM report
select to_char(SYSDATE, 'DD-Mon-YYYY HH24:MI:SS') "Time" from dual;
prompt Version
select * from V$VERSION where BANNER like '%Database%' order by 1;
prompt Cluster wide operations
select * from GV$ASM_OPERATION order by 1;
prompt
prompt Disk groups, including the dismounted disk groups
select * from V$ASM_DISKGROUP order by 1, 2, 3;
prompt All disks, including the candidate disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, LABEL, PATH, MOUNT_STATUS, HEADER_STATUS, STATE, OS_MB, TOTAL_MB, FREE_MB, CREATE_DATE, MOUNT_DATE, SECTOR_SIZE, VOTING_FILE, FAILGROUP_TYPE
from V$ASM_DISK
where MODE_STATUS='ONLINE'
order by 1, 2;
prompt Offline disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, MOUNT_STATUS, HEADER_STATUS, STATE, REPAIR_TIMER
from V$ASM_DISK
where MODE_STATUS='OFFLINE'
order by 1, 2;
prompt Disk group attributes
select GROUP_NUMBER, NAME, VALUE from V$ASM_ATTRIBUTE where NAME not like 'template%' order by 1;
prompt Connected clients
select * from V$ASM_CLIENT order by 1, 2;
prompt ASM specific initialisation parameters, including the hidden ones
select KSPPINM "Parameter", KSPFTCTXVL "Value", kspftctxdf "Default"
from X$KSPPI a, X$KSPPCV2 b
where a.INDX + 1 = KSPFTCTXPN and (KSPPINM like '%asm%' or KSPPINM like '%balance%' or KSPPINM like '%auto_manage%')
order by 1 desc;
prompt Memory, cluster and instance specific initialisation parameters
select NAME "Parameter", VALUE "Value", ISDEFAULT "Default"
from V$PARAMETER
where NAME like '%target%' or NAME like '%pool%' or NAME like 'cluster%' or NAME like 'instance%'
order by 1;
prompt Disk group imbalance
select g.NAME "Diskgroup",
100*(max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576)))-min((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))))/max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))) "Imbalance",
count(*) "Disk count",
g.TYPE "Type"
from V$ASM_DISK_STAT d , V$ASM_DISKGROUP_STAT g
where d.GROUP_NUMBER = g.GROUP_NUMBER and d.STATE = 'NORMAL' and d.MOUNT_STATUS = 'CACHED'
group by g.NAME, g.TYPE;
prompt End of ASM report
set markup html off
set echo on
set feedback on
exit

To produce the report, that will be saved as /tmp/ASM_report.html, as per the spool command, run the following command as the OS user that owns the Grid Infrastructure home (usually grid or oracle), against an ASM instance (say +ASM1), like this:

$ sqlplus -S / as sysasm @asm_report.sql

The report

The reports first shows the time of the report and ASM version.

It then shows if there are any ASM operations in progress. In this excerpt we see a rebalance running in ASM instance 1. It can also be seen that the resync and rebalance have completed and that the compacting is the only outstanding operation:

Next we see the information about all disk groups, including the dismounted disk groups. This is then followed by the info about disks, again with the note that this includes the candidate disks.

I have separated the info about offline disks, as this may be of interest when dealing with disk issues. That section looks like this:

Next are the disk group attributes, with the note that this will be displayed only for ASM version 11.1 and later, as we did not have the disk group attributes in earlier versions.

This is followed by the list of connected clients, usually database instances served by that ASM instance.

The section with ASM initialisation parameters includes hidden and some Exadata specific (_auto_manage) parameters. Here is a small sample:

I have also separated the memory, cluster and instance specific initialisation parameters as they are often of special interest.

The last section shows the disk group imbalance report.

Sample reports

Here is a sample report from an Exadata system: ASM_report_Exa.html.

And here is a sample report from a version 12c Oracle Restart system: ASM_report_12c.html.

Conclusion

While I use this report for a quick overview of the ASM, it can also be used as a 'backup' info about your ASM setup. You are welcome to modify the script to produce a report that suits your needs. Please let me know if you find any issues with the script or if you have suggestions for improvements.

↧

The ASM password directory

January 27, 2014, 12:37 am

≫ Next: ACFS disk group rebalance

≪ Previous: Tell me about your ASM

Password file authentication for Oracle Database or ASM can work for both local and remote connections. In Oracle version 12c, the password files can reside in an ASM disk group. The ASM metadata structure for managing the passwords is the ASM Password Directory - ASM metadata file number 13.

Note that the password files are accessible only after the disk group is mounted. One implication of this is that no remote SYSASM access to ASM and no remote SYSDBA access to database is possible, until the disk group with the password file is mounted.

The password file

The disk group based password file is managed by ASMCMD commands, ORAPWD tool and SRVCTL commands. The password file can be created with ORAPWD and ASMCA (at the time ASM is configured). All other password file manipulations are performed with ASMCMD or SRVCTL commands.

The COMPATIBLE.ASM disk group attribute must be set to at least 12.1 for the disk group where the password is to be located. The SYSASM privilege is required to manage the ASM password file and the SYSDBA privilege is required to manage the database password file.

Let's create the ASM password file in disk group DATA.

First make sure the COMPATIBLE.ASM attribute is set to the minimum required value:

$ asmcmd lsattr -G DATA -l compatible.asm

Name Value

compatible.asm 12.1.0.0.0

Create the ASM password file:

$ orapwd file='+DATA/orapwasm' asm=y

Enter password for SYS: *******

$

Get the ASM password file name:

$ asmcmd pwget --asm

+DATA/orapwasm

And finally, find the ASM password file location and fully qualified name:

$ asmcmd find +DATA "*" --type password

+DATA/ASM/PASSWORD/pwdasm.256.837972683

+DATA/orapwasm

From this we see that +DATA/orapwasm is an alias to the actual file that has a special (+[DISK GROUP NAME]/ASM/PASSWORD) placeholder.

The ASM password directory

The ASM metadata structure for managing the disk group based passwords is the ASM Password Directory - ASM metadata file number 13. Note that the password file is also managed by the ASM File Directory, like any other ASM based file. I guess this redundancy just highlights the importance of the password file.

Let's locate the ASM Password Directory. As that is file number 13, we can look it up in the ASM File Directory. That means, we first need to locate the ASM File Directory itself. The pointer to the first AU of the ASM File Directory is in the disk header of disk 0, in the field f1b2locn:

First match the disk numbers to disk paths for disk group DATA:

$ asmcmd lsdsk -p -G DATA | cut -c12-21,78-88

Disk_Num Path

0 /dev/sdc1

1 /dev/sdd1

2 /dev/sde1

3 /dev/sdf1

Now get the starting point of the disk directory:

$ kfed read /dev/sdc1 | grep f1b1locn

kfdhdb.f1b1locn: 10 ; 0x0d4: 0x0000000a

This is telling us that the file directory starts at AU 10 on that disk. Now look up block 13 in AU 10 - that will be the directory entry for ASM file 13, i.e. the ASM Password Directory.

$ kfed read /dev/sdc1 aun=10 blkn=13 | egrep "au|disk" | head

kfffde[0].xptr.au: 47 ; 0x4a0: 0x0000002f

kfffde[0].xptr.disk: 2 ; 0x4a4: 0x0002

kfffde[1].xptr.au: 45 ; 0x4a8: 0x0000002d

kfffde[1].xptr.disk: 1 ; 0x4ac: 0x0001

kfffde[2].xptr.au: 46 ; 0x4b0: 0x0000002e

kfffde[2].xptr.disk: 3 ; 0x4b4: 0x0003

kfffde[3].xptr.au: 4294967295 ; 0x4b8: 0xffffffff

kfffde[3].xptr.disk: 65535 ; 0x4bc: 0xffff

...

The output is telling us that the ASM Password Directory is in AU 47 on disk 2 (with copies in AU 45 on disk 1, and AU 46 on disk 3). Note that the ASM Password Directory is triple mirrored, even in a normal redundancy disk group.

Now look at AU 47 on disk 2:

$ kfed read /dev/sde1 aun=47 blkn=1 | more

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 29 ; 0x002: KFBTYP_PASWDDIR

...

kfzpdb.block.incarn: 3 ; 0x000: A=1 NUMM=0x1

kfzpdb.block.frlist.number: 4294967295 ; 0x004: 0xffffffff

kfzpdb.block.frlist.incarn: 0 ; 0x008: A=0 NUMM=0x0

kfzpdb.next.number: 15 ; 0x00c: 0x0000000f

kfzpdb.next.incarn: 3 ; 0x010: A=1 NUMM=0x1

kfzpdb.flags: 0 ; 0x014: 0x00000000

kfzpdb.file: 256 ; 0x018: 0x00000100

kfzpdb.finc: 837972683 ; 0x01c: 0x31f272cb

kfzpdb.bcount: 15 ; 0x020: 0x0000000f

kfzpdb.size: 512 ; 0x024: 0x00000200

...

The ASM metadata block type is KFBTYP_PASWDDIR, i.e. the ASM Password Directory. We see that it points to ASM file number 256 (kfzpdb.file=256). From the earlier asmcmd find command we already know the actual password file is ASM file 256 in disk group DATA:

$ asmcmd ls -l +DATA/ASM/PASSWORD

Type Redund Striped Time Sys Name

PASSWORD HIGH COARSE JAN 27 18:00:00 Y pwdasm.256.837972683

Again, note that the ASM password file is triple mirrored (Redund=HIGH) even in a normal redundancy disk group.

Conclusion

Starting with ASM version 12c we can store ASM and database password files in an ASM disk group. The ASM password can be created at the time of Grid Infrastructure installation or later with the ORAPWD command. The disk group based password files are managed with ASMCMD, ORAPWD and SRVCTL commands.

↧

ACFS disk group rebalance

February 18, 2014, 1:02 am

≫ Next: ASM file number 3

≪ Previous: The ASM password directory

Starting with Oracle Database version 11.2, an ASM disk group can be used for hosting one or more cluster file systems. These are known as Oracle ASM Cluster File Systems or Oracle ACFS. This functionality is achieved by creating special volume files inside the ASM disk group, which are then exposed to the OS as the block devices. The file systems are then created on those block devices.

This post is about the rebalancing, mirroring and extent management of the ACFS volume files.

The environment used for the examples:

* 64-bit Oracle Linux 5.4, in Oracle Virtual Box

* Oracle Restart and ASM version 11.2.0.3.0 - 64bit

* ASMLib/oracleasm version 2.1.7

Set up ACFS volumes

As this is an Oracle Restart environment (single instance), I have to load ADVM/ACFS drivers manually (as root user).

# acfsload start

ACFS-9391: Checking for existing ADVM/ACFS installation.

ACFS-9392: Validating ADVM/ACFS installation files for operating system.

ACFS-9393: Verifying ASM Administrator setup.

ACFS-9308: Loading installed ADVM/ACFS drivers.

ACFS-9154: Loading 'oracleoks.ko' driver.

ACFS-9154: Loading 'oracleadvm.ko' driver.

ACFS-9154: Loading 'oracleacfs.ko' driver.

ACFS-9327: Verifying ADVM/ACFS devices.

ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.

ACFS-9156: Detecting control device '/dev/ofsctl'.

ACFS-9322: completed

Create a disk group to hold ASM cluster file systems.

$ sqlplus / as sysasm

SQL> create diskgroup ACFS

disk 'ORCL:ASMDISK5', 'ORCL:ASMDISK6'

attribute 'COMPATIBLE.ASM' = '11.2', 'COMPATIBLE.ADVM' = '11.2';

Diskgroup created.

SQL>

While it is possible and supported to have a disk group that holds database files and ACFS volume files, I recommend to have a separate disk group for ACFS volumes. This provides the role/function separation and potential performance benefits to database files.

Check the allocation unit (AU) sizes for all disk groups.

SQL> select group_number "Group#", name "Name", allocation_unit_size "AU size"

from v$asm_diskgroup_stat;

Group# Name AU size

---------- -------- ----------

1 ACFS 1048576

2 DATA 1048576

SQL>

Note the default AU size (1MB) for both disk groups. I will refer to this later on, when I talk about the extent sizes for the volume files.

Create some volumes in disk group ACFS.

$ asmcmd volcreate -G ACFS -s 4G VOL1

$ asmcmd volcreate -G ACFS -s 2G VOL2

$ asmcmd volcreate -G ACFS -s 1G VOL3

Get the volume info.

$ asmcmd volinfo -a

Diskgroup Name: ACFS

Volume Name: VOL1

Volume Device: /dev/asm/vol1-142

State: ENABLED

Size (MB): 4096

Resize Unit (MB): 32

Redundancy: MIRROR

Stripe Columns: 4

Stripe Width (K): 128

Usage:

Mountpath:

Volume Name: VOL2

Volume Device: /dev/asm/vol2-142

State: ENABLED

Size (MB): 2048

Resize Unit (MB): 32

Redundancy: MIRROR

Stripe Columns: 4

Stripe Width (K): 128

Usage:

Mountpath:

Volume Name: VOL3

Volume Device: /dev/asm/vol3-142

State: ENABLED

Size (MB): 1024

Resize Unit (MB): 32

Redundancy: MIRROR

Stripe Columns: 4

Stripe Width (K): 128

Usage:

Mountpath:

Note that the volumes are automatically enabled after creation. On (server) restart we would need to manually load ADVM/ACFS drivers (acfsload start) and enable the volumes (asmcmd volenable -a).

ASM files for ACFS support

For each volume, the ASM creates a volume file. In a redundant disk group, each volume will also have a dirty region logging (DRL) file.

Get some info about our volume files.

SQL> select file_number "File#", volume_name "Volume", volume_device "Device", size_mb "MB", drl_file_number "DRL#"

from v$asm_volume;

File# Volume Device MB DRL#

------ ------ ----------------- ----- ----

256 VOL1 /dev/asm/vol1-142 4096 257

259 VOL2 /dev/asm/vol2-142 2048 258

261 VOL3 /dev/asm/vol3-142 1024 260

SQL>

In addition to volume names, device names and sizes, this shows ASM files numbers 256, 259 and 261 for volume devices, and ASM file numbers 257, 258 and 260 for the associated DRL files.

Volume file extents

Get the extent distribution info for one of the volume files.

SQL> select xnum_kffxp "Extent", au_kffxp "AU", disk_kffxp "Disk"

from x$kffxp

where group_kffxp=2 and number_kffxp=261

order by 1,2;

Extent AU Disk

---------- ---------- ----------

0 6256 0

0 6256 1

1 6264 0

1 6264 1

2 6272 1

2 6272 0

3 6280 0

3 6280 1

...

127 7272 0

127 7272 1

2147483648 6252 0

2147483648 6252 1

2147483648 4294967294 65534

259 rows selected.

SQL>

First thing to note is that each extent is mirrored, as the volume is in a normal redundancy disk group.

We also see that the volume file 261 has 128 extents. As the volume size is 1GB, that means each extent size is 8MB or 8 AUs. The point here is that the volume files have their own extent size, unlike the standard ASM files that inherit the (initial) extent size from the disk group AU size.

ASM based cluster file systems

We can now use the volumes to create ASM cluster file systems and let everyone use them (this needs to be done as root user, of course):

# mkdir /acfs1

# mkdir /acfs2

# mkdir /acfs3

# chmod 777 /acfs?

# /sbin/mkfs -t acfs /dev/asm/vol1-142

mkfs.acfs: version = 11.2.0.3.0

mkfs.acfs: on-disk version = 39.0

mkfs.acfs: volume = /dev/asm/vol1-142

mkfs.acfs: volume size = 4294967296

mkfs.acfs: Format complete.

# /sbin/mkfs -t acfs /dev/asm/vol2-142

mkfs.acfs: version = 11.2.0.3.0

mkfs.acfs: on-disk version = 39.0

mkfs.acfs: volume = /dev/asm/vol2-142

mkfs.acfs: volume size = 2147483648

mkfs.acfs: Format complete.

# /sbin/mkfs -t acfs /dev/asm/vol3-142

mkfs.acfs: version = 11.2.0.3.0

mkfs.acfs: on-disk version = 39.0

mkfs.acfs: volume = /dev/asm/vol3-142

mkfs.acfs: volume size = 1073741824

mkfs.acfs: Format complete.

# mount -t acfs /dev/asm/vol1-142 /acfs1

# mount -t acfs /dev/asm/vol2-142 /acfs2

# mount -t acfs /dev/asm/vol3-142 /acfs3

# mount | grep acfs

/dev/asm/vol1-142 on /acfs1 type acfs (rw)

/dev/asm/vol2-142 on /acfs2 type acfs (rw)

/dev/asm/vol3-142 on /acfs3 type acfs (rw)

Copy some files into the new file systems.

$ cp diag/asm/+asm/+ASM/trace/* /acfs1

$ cp diag/rdbms/db/DB/trace/* /acfs1

$ cp oradata/DB/datafile/* /acfs1

$ cp diag/asm/+asm/+ASM/trace/* /acfs2

$ cp oradata/DB/datafile/* /acfs2

$ cp fra/DB/backupset/* /acfs3

Check the used space.

$ df -h /acfs?

Filesystem Size Used Avail Use% Mounted on

/dev/asm/vol1-142 4.0G 1.3G 2.8G 31% /acfs1

/dev/asm/vol2-142 2.0G 1.3G 797M 62% /acfs2

/dev/asm/vol3-142 1.0G 577M 448M 57% /acfs3

ACFS disk group rebalance

Let's add one disk to the ACFS disk group and monitor the rebalance operation.

SQL> alter diskgroup ACFS add disk 'ORCL:ASMDISK4';

Diskgroup altered.

SQL>

Get the ARB0 PID from the ASM alert log.

$ tail alert_+ASM.log

Sat Feb 15 12:44:53 2014

SQL> alter diskgroup ACFS add disk 'ORCL:ASMDISK4'

NOTE: Assigning number (2,2) to disk (ORCL:ASMDISK4)

...

NOTE: starting rebalance of group 2/0x80486fe8 (ACFS) at power 1

SUCCESS: alter diskgroup ACFS add disk 'ORCL:ASMDISK4'

Starting background process ARB0

Sat Feb 15 12:45:00 2014

ARB0 started with pid=27, OS id=10767

...

And monitor the rebalance by tailing the ARB0 trace file.

$ tail -f ./+ASM_arb0_10767.trc

*** ACTION NAME:() 2014-02-15 12:45:00.151

ARB0 relocating file +ACFS.1.1 (2 entries)

ARB0 relocating file +ACFS.2.1 (1 entries)

ARB0 relocating file +ACFS.3.1 (42 entries)

ARB0 relocating file +ACFS.3.1 (1 entries)

ARB0 relocating file +ACFS.4.1 (2 entries)

ARB0 relocating file +ACFS.5.1 (1 entries)

ARB0 relocating file +ACFS.6.1 (1 entries)

ARB0 relocating file +ACFS.7.1 (1 entries)

ARB0 relocating file +ACFS.8.1 (1 entries)

ARB0 relocating file +ACFS.9.1 (1 entries)

ARB0 relocating file +ACFS.256.839587727 (120 entries)

*** 2014-02-15 12:46:58.905

ARB0 relocating file +ACFS.256.839587727 (117 entries)

ARB0 relocating file +ACFS.256.839587727 (1 entries)

ARB0 relocating file +ACFS.257.839587727 (17 entries)

ARB0 relocating file +ACFS.258.839590377 (17 entries)

*** 2014-02-15 12:47:50.744

ARB0 relocating file +ACFS.259.839590377 (119 entries)

ARB0 relocating file +ACFS.259.839590377 (1 entries)

ARB0 relocating file +ACFS.260.839590389 (17 entries)

ARB0 relocating file +ACFS.261.839590389 (60 entries)

ARB0 relocating file +ACFS.261.839590389 (1 entries)

...

We see that the rebalance is per ASM file. This is exactly the same behaviour as with database files - ASM performs the rebalance on a per file basis. The ASM metadata files (1-9) get rebalanced first. The ASM then rebalances the volume file 256, DRL file 257, and so on.

From this we see that the ASM rebalances volume files (and other ASM files), not the OS files in the associated file system(s).

Disk online operation in an ACFS disk group

When an ASM disk goes offline, the ASM creates thestaleness registry and staleness directory, to track the extents that should be modified on the offline disk. Once the disk comes back online, the ASM uses that information to perform the fast mirror resync.

That functionality is not available to volume files in ASM version 11.2. Instead, to online the disk, the ASM rebuilds the entire content of that disk. This is why the disk online performance, for disk groups with volume files, is inferior to the disk group with standard database files.

The fast mirror resync functionality for volume files is available in ASM version 12.1 and later.

Conclusion

ASM disk groups can be used to host a general purpose cluster file systems. ASM does this by creating volume files inside the disk groups, that are exposed to the operating system as block devices.

Existing ASM disk group mirroring functionality (normal and high redundancy) can be used to protect the user files at the file system level. ASM does this by mirroring extents for the volume files, in the same fashion it does this for any other ASM file. The volume files have their own extent sizes, unlike the standard database files that inherit the (initial) extent size from the disk group AU size.

The rebalance operation, in an ASM disk group that hosts ASM cluster file system volumes, is per volume file, not per the individual user files stored in the associated file system(s).

↧

ASM file number 3

January 3, 2012, 4:23 am

≫ Next: ASM spfile in a disk group

≪ Previous: ACFS disk group rebalance

When ASM needs to make an atomic change to multiple metadata blocks, a log record is written into the Active Change Directory (ACD), which is ASM metadata file number 3. These log records are written in a single I/O.

ACD is divided into chunks (threads) and each running ASM instance has its own 42 MB chunk. When a disk group is created, a single chunk is allocated for the ACD. As more instances mount that disk group, the ACD grows (by 42 MB) to accommodate every running instance with its own ACD chunk.

The ACD components are:

ACDC - ACD checkpoint
ABA - ACD block address
LGE - ACD redo log record
BCD - ACD block change descriptor

Locating the active change directory

We can query X$KFFXP to find the ACD allocation units. This is ASM file number 3 hence number_kffxp=3 in our query:

SQL> SELECT x.xnum_kffxp "Extent",
x.au_kffxp "AU",
x.disk_kffxp "Disk #",
d.name "Disk name"
FROM x$kffxp x, v$asm_disk_stat d
WHERE x.group_kffxp=d.group_number
and x.disk_kffxp=d.disk_number
and x.group_kffxp=1
and x.number_kffxp=3
ORDER BY 1, 2;

Extent AU Disk # Disk name
---------- ---------- ---------- ------------------------------
0 4 0 ASMDISK5
1 2 1 ASMDISK6
2 5 0 ASMDISK5
...
39 21 1 ASMDISK6
40 24 0 ASMDISK5
41 22 1 ASMDISK6

42 rows selected.

The query returned 42 rows, i.e. 42 allocation units. As the allocation unit size for this disk group is 1MB, that means the total size of the ACD is 42 MB.

If I recreate the disk group with the larger allocation unit size, say 4 MB, we should still end up with a 42 MB ACD. Let's have a look:

SQL> create diskgroup RECO external redundancy
disk 'ORCL:ASMDISK5', 'ORCL:ASMDISK6'
attribute 'au_size'='4M';

Diskgroup created.

And now the query from x$kffxp and v$asm_disk_stat returns 11 rows (showing that the ACD size is still 42 MB):

Extent AU Disk # Disk name
---------- ---------- ---------- ------------------------------
0 3 1 ASMDISK6
1 3 0 ASMDISK5
2 4 1 ASMDISK6
...
10 8 1 ASMDISK6

11 rows selected.

Closer look at the ACD

Let's look at the ACD using the kfed utility. The last query shows that the ACD start at AU 3 on disk ASMDISK6. Note that with the allocation unit size of 4 MB, I have to specify ausz=4m on the kfed command line:

$ kfed read /dev/oracleasm/disks/ASMDISK6 ausz=4m aun=3 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
...
kfracdc.eyec[0]: 65 ; 0x000: 0x41
kfracdc.eyec[1]: 67 ; 0x001: 0x43
kfracdc.eyec[2]: 68 ; 0x002: 0x44
kfracdc.eyec[3]: 67 ; 0x003: 0x43
kfracdc.thread: 1 ; 0x004: 0x00000001
kfracdc.lastAba.seq: 4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk: 4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0: 1 ; 0x010: 0x00000001
kfracdc.blks: 11263 ; 0x014: 0x00002bff
kfracdc.ckpt.seq: 2 ; 0x018: 0x00000002
kfracdc.ckpt.blk: 2 ; 0x01c: 0x00000002
kfracdc.fcn.base: 16 ; 0x020: 0x00000010
kfracdc.fcn.wrap: 0 ; 0x024: 0x00000000
kfracdc.bufBlks: 512 ; 0x028: 0x00000200
kfracdc.strt112.seq: 0 ; 0x02c: 0x00000000
kfracdc.strt112.blk: 0 ; 0x030: 0x00000000

The output shows that this is indeed an ACD block (kfbh.type=KFBTYP_ACDC). The only interesting piece of information here is kfracdc.thread=1, which meand that this ACD belong to ASM instance 1. In a cluster, this would match the ASM instance number.

That was block 0, the begining of the ACD. Let's now look at block 1 - the actual ACD data.

$ kfed read /dev/oracleasm/disks/ASMDISK6 ausz=4m aun=3 blkn=1 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 8 ; 0x002: KFBTYP_CHNGDIR
...
kfracdb.lge[0].valid: 1 ; 0x00c: V=1 B=0 M=0
kfracdb.lge[0].chgCount: 1 ; 0x00d: 0x01
kfracdb.lge[0].len: 52 ; 0x00e: 0x0034
kfracdb.lge[0].kfcn.base: 13 ; 0x010: 0x0000000d
kfracdb.lge[0].kfcn.wrap: 0 ; 0x014: 0x00000000
kfracdb.lge[0].bcd[0].kfbl.blk: 0 ; 0x018: blk=0
kfracdb.lge[0].bcd[0].kfbl.obj: 4 ; 0x01c: file=4
kfracdb.lge[0].bcd[0].kfcn.base: 0 ; 0x020: 0x00000000
kfracdb.lge[0].bcd[0].kfcn.wrap: 0 ; 0x024: 0x00000000
kfracdb.lge[0].bcd[0].oplen: 4 ; 0x028: 0x0004
kfracdb.lge[0].bcd[0].blkIndex: 0 ; 0x02a: 0x0000
kfracdb.lge[0].bcd[0].flags: 28 ; 0x02c: F=0 N=0 F=1 L=1 V=1 A=0 C=0
kfracdb.lge[0].bcd[0].opcode: 212 ; 0x02e: 0x00d4
kfracdb.lge[0].bcd[0].kfbtyp: 9 ; 0x030: KFBTYP_COD_BGO
kfracdb.lge[0].bcd[0].redund: 17 ; 0x031: SCHE=0x1 NUMB=0x1
kfracdb.lge[0].bcd[0].pad: 63903 ; 0x032: 0xf99f
kfracdb.lge[0].bcd[0].KFRCOD_CRASH: 1 ; 0x034: 0x00000001
kfracdb.lge[0].bcd[0].au[0]: 8 ; 0x038: 0x00000008
kfracdb.lge[0].bcd[0].disks[0]: 0 ; 0x03c: 0x0000
...

We see that the ACD block 1 is of type KFBTYP_CHNGDIR, and contains the elements of kfracdb.lge[i] structure - the ASM redo records. Some of the things of interest there are the operation being performed (opcode) and the operation type (kfbtyp). None of this is very useful outside of the ACD context, so we will leave it at that.

Conclusion

This is an informational post only, to complete the ASM metadata story, as there are no practical benefits of understanding the inner works of the ACD.

↧

ASM spfile in a disk group

March 30, 2014, 3:00 am

≫ Next: ASM file number 9

≪ Previous: ASM file number 3

Starting with ASM version 11.2, the ASM spfile can be stored in an ASM disk group. Indeed, during a new ASM installation, the Oracle Universal Installer (OUI) will place the ASM spfile in the disk group that gets created during the installation. This is true for both Oracle Restart (single instance environments) and Cluster installations.

The ASM spfile stored in the disk group is a registry file, and will always be the ASM metadata file number 253.

New ASMCMD commands

To support this feature, new ASMCMD commands were introduced to back up, copy and move the ASM spfile. The commands are:

spbackup - backs up an ASM spfile to a backup file. The backup file is not a special file type and is not identified as an spfile.
spcopy - copies an ASM spfile from the source location to an spfile in the destination location.
spmove - moves an ASM spfile from source to destination and automatically updates the GPnP profile.

The SQL commands CREATE PFILE FROM SPFILE and CREATE SPFILE FROM PFILE are still valid for the ASM spfile stored in the disk group.

ASM spfile in disk group DATA

In my environment, the ASM spfile is (somewhere) in disk group DATA. Let's find it:

$ asmcmd find --type ASMPARAMETERFILE +DATA "*"

+DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.822856169

As we can see, the ASM spfile is in a special location and it has ASM file number 253.

Of course, we see the same thing from sqlplus:

$ sqlplus / as sysasm

SQL> show parameter spfile

NAME TYPE VALUE

------ ------ -------------------------------------------------

spfile string +DATA/ASM/ASMPARAMETERFILE/registry.253.822856169

SQL>

Let's make a backup of that ASM spfile.

$ asmcmd spbackup +DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.822856169 /tmp/ASMspfile.backup

And check out the contents of the file:

$ strings /tmp/ASMspfile.backup

+ASM.__oracle_base='/u01/app/grid'#ORACLE_BASE set from in memory value

+ASM.asm_diskgroups='RECO','ACFS'#Manual Mount

*.asm_power_limit=1

*.large_pool_size=12M

*.remote_login_passwordfile='EXCLUSIVE'

As we can see, this is a copy of the ASM spfile, that includes the parameters and associated comments.

ASM spfile discovery

So, how can the ASM instance read the spfile on startup, if the spfile is in a disk group that is not even mounted yet? Not only that - the ASM doesn't really know which disk group has the spfile and even if the spfile exists.

The trick is in the ASM disk headers. To support the ASM spfile in a disk group, two new fields were added to the ASM disk header:

kfdhdb.spfile - Allocation unit number of the ASM spfile.
kfdhdb.spfflg - ASM spfile flag. If this value is 1, the ASM spfile is on this disk in allocation unit kfdhdb.spfile.

When the ASM instance starts, it reads the disk headers and looks for the spfile information. Once it finds the disks that have the spfile, it can read the actual initialization parameters.

Let's have a look at my disk group DATA. First check the disk group state and redundancy

$ asmcmd lsdg -g DATA | cut -c1-26

Inst_ID State Type

1 MOUNTED NORMAL

The disk group is mounted and it's normal redundancy. This means the ASM spfile will be mirrored, so we should see two disks with kfdhdb.spfile and kfdhdb.spfflg values set. Let's have a look:

$ for disk in `asmcmd lsdsk -G DATA --suppressheader`

> do

> echo $disk

> kfed read $disk | grep spf

> done

/dev/sdc1

kfdhdb.spfile: 46 ; 0x0f4: 0x0000002e

kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001

/dev/sdd1

kfdhdb.spfile: 2212 ; 0x0f4: 0x000008a4

kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001

/dev/sde1

kfdhdb.spfile: 0 ; 0x0f4: 0x00000000

kfdhdb.spfflg: 0 ; 0x0f8: 0x00000000

As we can see, two disks have the ASM spfile. Let's check the contents of the Allocation Unit 46 on disk /dev/sdc1:

$ dd if=/dev/sdc1 bs=1048576 skip=46 count=1 | strings

+ASM.__oracle_base='/u01/app/grid'#ORACLE_BASE set from in memory value

+ASM.asm_diskgroups='RECO','ACFS'#Manual Mount

*.asm_power_limit=1

*.large_pool_size=12M

*.remote_login_passwordfile='EXCLUSIVE'

1+0 records in

1+0 records out

1048576 bytes (1.0 MB) copied, 0.0352732 s, 29.7 MB/s

As we can see, the AU 46 on disk /dev/sdc1 indeed contains the ASM spfile.

ASM spfile alias block

In addition to the new ASM disk header fields, there is a new metadata block type - KFBTYP_ASMSPFALS - that describes the ASM spfile alias. The ASM spfile alias block will be the last block in the ASM spfile.

Let's have a look at the last block of the Allocation Unit 46:

$ kfed read /dev/sdc1 aun=46 blkn=255

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 27 ; 0x002: KFBTYP_ASMSPFALS

kfbh.datfmt: 1 ; 0x003: 0x01

kfbh.block.blk: 255 ; 0x004: blk=255

kfbh.block.obj: 253 ; 0x008: file=253

kfbh.check: 806373865 ; 0x00c: 0x301049e9

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

kfspbals.incarn: 822856169 ; 0x000: 0x310bc9e9

kfspbals.blksz: 512 ; 0x004: 0x00000200

kfspbals.size: 3 ; 0x008: 0x0003

kfspbals.path.len: 0 ; 0x00a: 0x0000

kfspbals.path.buf: ; 0x00c: length=0

There is not much in this metadata block. Most of the entries have the block header info (fields kfbh.*). The actual ASM spfile alias data (fields kfspbals.*) has only few entries. The spfile file incarnation (822856169) is part of the file name (REGISTRY.253.822856169), the block size is 512 (bytes) and the file size is 3 blocks. The path info is empty, meaning I don't actually have the ASM spfile alias.

Let's create one. I will first create a pfile from the existing spfile and then create the spfile alias from that pfile.

$ sqlplus / as sysasm

SQL> create pfile='/tmp/pfile+ASM.ora' from spfile;

File created.

SQL> shutdown abort;

ASM instance shutdown

SQL> startup pfile='/tmp/pfile+ASM.ora';

ASM instance started

Total System Global Area 1135747072 bytes

Fixed Size 2297344 bytes

Variable Size 1108283904 bytes

ASM Cache 25165824 bytes

ASM diskgroups mounted

SQL> create spfile='+DATA/spfileASM.ora' from pfile='/tmp/pfile+ASM.ora';

File created.

SQL> exit

Looking for the ASM spfile again shows two entries:

$ asmcmd find --type ASMPARAMETERFILE +DATA "*"

+DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.843597139

+DATA/spfileASM.ora

We now see the ASM spfile itself (REGISTRY.253.843597139) and its alias (spfileASM.ora). Having a closer look at spfileASM.ora confirms this is indeed the alias for the registry file:

$ asmcmd ls -l +DATA/spfileASM.ora

Type Redund Striped Time Sys Name

ASMPARAMETERFILE MIRROR COARSE MAR 30 20:00:00 N spfileASM.ora => +DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.843597139

Check the ASM spfile alias block now:

$ kfed read /dev/sdc1 aun=46 blkn=255

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 27 ; 0x002: KFBTYP_ASMSPFALS

kfbh.datfmt: 1 ; 0x003: 0x01

kfbh.block.blk: 255 ; 0x004: blk=255

kfbh.block.obj: 253 ; 0x008: file=253

kfbh.check: 2065104480 ; 0x00c: 0x7b16fe60

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

kfspbals.incarn: 843597139 ; 0x000: 0x32484553

kfspbals.blksz: 512 ; 0x004: 0x00000200

kfspbals.size: 3 ; 0x008: 0x0003

kfspbals.path.len: 13 ; 0x00a: 0x000d

kfspbals.path.buf: spfileASM.ora ; 0x00c: length=13

Now we see that the alias file name now appears in the ASM spfile alias block. Note the new incarnation number, as this is a new ASM spfile, created from the pfile.

Conclusion

Starting with ASM version 11.2, the ASM spfile can be stored in an ASM disk group. To support this feature, we now have new ASMCMD commands and, under the covers, we have new ASM metadata structures.

↧

ASM file number 9

January 8, 2012, 9:12 pm

≫ Next: Disk Group Attributes

≪ Previous: ASM spfile in a disk group

The attributes directory - ASM file number 9 - contains metadata about disk group attributes. The attributes directory exists only in disk groups where COMPATIBLE.ASM is 11.1 or higher.

Disk group attributes were introduced in ASM version 11.1^[1] and can be used to fine tune the disk group properties. It is worth noting that some attributes can be set only at the time the disk group is created, while other attributes can be set at any time. Some attributes might be stored in the disk header, for example the AU_SIZE, while some other attributes, for example COMPATIBLE.ASM, can be stored either in the partnership and status table or in the disk header (depending on the ASM version).

Public attributes

Most attributes are stored in the attributes directory and are externalized via V$ASM_ATTRIBUTE view. Let's have a look at disk group attributes for all my disk groups.

SQL> SELECT g.name "Group name", a.name "Attribute", a.value "Value"
FROM v$asm_diskgroup g, v$asm_attribute a
WHERE g.group_number=a.group_number and a.name not like 'template%';

Group name Attribute Value
------------ ------------------------------------------------ ----------------
ACFS disk_repair_time 3.6h
au_size 1048576
access_control.umask 026
access_control.enabled TRUE
cell.smart_scan_capable FALSE
compatible.advm 11.2.0.0.0
compatible.rdbms 11.2
compatible.asm 11.2.0.0.0
sector_size 512
DATA access_control.enabled TRUE
cell.smart_scan_capable FALSE
compatible.rdbms 11.2
compatible.asm 11.2.0.0.0
sector_size 512
au_size 1048576
disk_repair_time 3.6h
access_control.umask 026

One attribute I can modify at any time is the disk repair timer. Let's use asmcmd to do that for disk group DATA.

$ asmcmd setattr -G DATA disk_repair_time '8.0h'

$ asmcmd lsattr -lm disk_repair_time
Group_Name Name Value RO Sys
ACFS disk_repair_time 3.6h N Y
DATA disk_repair_time 8.0h N Y

Hidden attributes

Locate the attributes directory.

SQL> SELECT x.disk_kffxp "Disk#",
x.xnum_kffxp "Extent",
x.au_kffxp "AU",
d.name "Disk name"
FROM x$kffxp x, v$asm_disk_stat d
WHERE x.group_kffxp=d.group_number
and x.disk_kffxp=d.disk_number
and d.group_number=2
and x.number_kffxp=9
ORDER BY 1, 2;

Disk# Extent AU Disk name
---------- ---------- ---------- ------------------------------
0 0 1146 ASMDISK1
1 0 1143 ASMDISK2
2 0 1150 ASMDISK3

Now check out all attributes with kfed.

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1150 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 23 ; 0x002: KFBTYP_ATTRDIR
...
kfede[0].entry.incarn: 1 ; 0x024: A=1 NUMM=0x0
kfede[0].entry.hash: 0 ; 0x028: 0x00000000
kfede[0].entry.refer.number: 4294967295 ; 0x02c: 0xffffffff
kfede[0].entry.refer.incarn: 0 ; 0x030: A=0 NUMM=0x0
kfede[0].name: disk_repair_time ; 0x034: length=16
kfede[0].value: 8.0h ; 0x074: length=4
...

Fields kfede[i] will have the disk group attribute names and values. Let's look at all of them:

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1150 | egrep "name|value"
kfede[0].name: disk_repair_time ; 0x034: length=16
kfede[0].value: 8.0h ; 0x074: length=4
kfede[1].name: _rebalance_compact ; 0x1a8: length=18
kfede[1].value: TRUE ; 0x1e8: length=4
kfede[2].name: _extent_sizes ; 0x31c: length=13
kfede[2].value: 1 4 16 ; 0x35c: length=6
kfede[3].name: _extent_counts ; 0x490: length=14
kfede[3].value: 20000 20000 214748367 ; 0x4d0: length=21
kfede[4].name: _ ; 0x604: length=1
kfede[4].value: 0 ; 0x644: length=1
kfede[5].name: au_size ; 0x778: length=7
kfede[5].value: ; 0x7b8: length=9
kfede[6].name: sector_size ; 0x8ec: length=11
kfede[6].value: ; 0x92c: length=9
kfede[7].name: compatible ; 0xa60: length=10
kfede[7].value: ; 0xaa0: length=9
kfede[8].name: cell ; 0xbd4: length=4
kfede[8].value: FALSE ; 0xc14: length=5
kfede[9].name: access_control ; 0xd48: length=14
kfede[9].value: FALSE ; 0xd88: length=5

This gives us a glimpse into the hidden (underscore) disk group attributes. We can see that the _rebalance_compact=TRUE. That is the attribute to do with the compacting phase of the disk group rebalance. We also see how extent size will grow (_extent_sizes) - initial size will be 1 AU, then 4 AU and finally 16 AU. And _extent_counts shows the breaking points for the extent size growth - first 20000 extents will be 1 AU in size, next 20000 will be 4 AU and the rest will be 16 AU.

Conclusion

Disk group attributes can be used to fine tune the disk group properties. Most attributes are stored in the attributes directory and are externalized via V$ASM_ATTRIBUTE view.

[1] In ASM version prior to 11.1 it was possible to create a disk group with user specified allocation unit size. That was done via hidden ASM initialization parameter _ASM_AUSIZE. While technically that was not a disk group attribute, it served the same purpose as the AU_SIZE attribute in ASM version 11.1 and later.

↧

Disk Group Attributes

May 4, 2014, 5:49 am

≫ Next: How to reconfigure Oracle Restart

≪ Previous: ASM file number 9

Disk group attributes were introduced in ASM version 11.1. They are bound to a disk group, rather than the ASM instance. Some attributes can be set only at the time the disk group is created, some only after the disk group is created, and some attributes can be set at any time, by altering the disk group.

This is the follow up on the attributes directory post.

ACCESS_CONTROL.ENABLED

This attribute determines whether ASM File Access Control is enabled for a disk group. The value can be TRUE or FALSE (default).

If the attribute is set to TRUE, accessing ASM files is subject to access control. If FALSE, any user can access every file in the disk group. All other operations behave independently of this attribute.

This attribute can only be set when altering a disk group.

ACCESS_CONTROL.UMASK

This attribute determines which permissions are masked out on the creation of an ASM file for the owner, group, and others not in the user group. This attribute applies to all files in a disk group.

The values can be combinations of three digits {0|2|6} {0|2|6} {0|2|6}. The default is 066.

Setting to '0' does not mask anything. Setting to '2' masks out write permission. Setting to '6' masks out both read and write permissions.

Before setting the ACCESS_CONTROL.UMASK disk group attribute, the ACCESS_CONTROL.ENABLED has to be set to TRUE.

This attribute can only be set when altering a disk group.

AU_SIZE

The AU_SIZE attribute controls the allocation unit size and can only be set when creating the disk group.

It is worth spelling out that each disk group can have a different allocation unit size.

CELL.SMART_SCAN_CAPABLE [Exadata]

This attribute is applicable to Exadata when the disk group is created from the grid disks on the storage cells. It enables the smart scan functionality for the objects stored in that disk group.

COMPATIBLE.ASM

The value for the disk group COMPATIBLE.ASM attribute determines the minimum software version for an ASM instance that can use the disk group. This setting also affects the format of the ASM metadata structures.

The default value for the COMPATIBLE.ASM is 10.1, when using the CREATE DISKGROUP statement, the ASMCMD mkdg command and the Enterprise Manager Create Disk Group page.

When creating a disk group with the ASMCA, the default value is 11.2 in ASM version 11gR2 and 12.1 in ASM version 12c.

COMPATIBLE.RDBMS

The value for the COMPATIBLE.RDBMS attribute determines the minimum COMPATIBLE database initialization parameter setting for any database instance that is allowed to use the disk group.

Before advancing the COMPATIBLE.RDBMS attribute, ensure that the values for the COMPATIBLE initialization parameter for all databases that access the disk group are set to at least the value of the new setting for the COMPATIBLE.RDBMS.

COMPATIBLE.ADVM

The value for the COMPATIBLE.ADVM attribute determines whether the disk group can contain the ASM volumes. The value must be set to 11.2 or higher. Before setting this attribute, the COMPATIBLE.ASM value must be 11.2 or higher. Also, the ADVM volume drivers must be loaded in the supported environment.

By default, the value of the COMPATIBLE.ADVM attribute is empty until set.

CONTENT.CHECK [12c]

The CONTENT.CHECK attributes enables or disables the content checking when performing the disk group rebalance. The attribute values can be TRUE or FALSE.

The content checking can include Hardware Assisted Resilient Data (HARD) checks on user data, validation of file types from the file directory against the block contents and the file directory information, and mirror side comparison.

When the attribute is set to TRUE, the logical content checking is enabled for all rebalance operations.

The content checking is also known as the disk scrubbing feature.

CONTENT.TYPE [11.2.0.3, Exadata]

The CONTENT.TYPE attribute identifies the disk group type that can be DATA, RECOVERY or SYSTEM. It determines the distance to the nearest partner disk/failgroup. The default value is DATA which specifies a distance of 1, the value of RECOVERY specifies a distance of 3 and the value of SYSTEM specifies a distance of 5.

The distance of 1 simply means that ASM considers all disks for partnership. The distance of 3 means that every 3rd disk will be considered for partnership and the distance of 5 means that every 5th disk will be considered for partnership.

The attribute can be specified when creating or altering a disk group. If the CONTENT.TYPE attribute is set or changed using the ALTER DISKGROUP, the new configuration does not take effect until a disk group rebalance is explicitly run.

The CONTENT.TYPE attribute is only valid for NORMAL and HIGH redundancy disk groups. The COMPATIBLE.ASM attribute must be set to 11.2.0.3 or higher to enable the CONTENT.TYPE attribute.

DISK_REPAIR_TIME

The value of the DISK_REPAIR_TIME attribute determines the amount of time ASM will keep the disk offline, before dropping it. This is relevant to the fast mirror resync feature for which the COMPATIBLE.ASM attribute must be set to 11.1 or higher.

This attribute can only be set when altering a disk group.

FAILGROUP_REPAIR_TIME [12c]

The FAILGROUP_REPAIR_TIME attribute specifies a default repair time for the failure groups in the disk group. The failure group repair time is used if ASM determines that an entire failure group has failed. The default value is 24 hours. If there is a repair time specified for a disk, such as with the DROP AFTER clause of the ALTER DISKGROUP OFFLINE DISK statement, that disk repair time overrides the failure group repair time.

This attribute can only be set when altering a disk group and is only applicable to NORMAL and HIGH redundancy disk groups.

IDP.BOUNDARY and IDP.TYPE [Exadata]

These attributes are used to configure Exadata storage, and are relevant for the Intelligent Data Placement feature.

PHYS_META_REPLICATED [12c]

The PHYS_META_REPLICATED attribute tracks the replication status of a disk group. When the ASM compatibility of a disk group is advanced to 12.1 or higher, the physical metadata of each disk is replicated. This metadata includes the disk header, free space table blocks and allocation table blocks. The replication is performed online asynchronously. This attribute value is set to true by ASM if the physical metadata of every disk in the disk group has been replicated.

This attribute is only defined in a disk group with the COMPATIBLE.ASM set to 12.1 and higher. The attribute is read-only and is intended for information only - a user cannot set or change its value. The values are either TRUE or FALSE.

SECTOR_SIZE

The SECTOR_SIZE attribute specifies the sector size for disks in a disk group and can only be set when creating a disk group.

The values for the SECTOR_SIZE can be 512, 4096 or 4K (provided the disks support those values). The default value is platform dependent. The COMPATIBLE.ASM and COMPATIBLE.RDBMS attributes must be set to 11.2 or higher to set the sector size to a value other than the default value.

NOTE: ASM Cluster File System (ACFS) does not support 4 KB sector drives.

STORAGE.TYPE

The STORAGE.TYPE attribute specifies the type of the disks in the disk group. The possible values are EXADATA, PILLAR, ZFSAS and OTHER. If the attribute is set to EXADATA|PILLAR|ZFSAS then all disks in the disk group must be of that type. If the attribute is set to OTHER, any types of disks can be in the disk group.

If the STORAGE.TYPE disk group attribute is set to PILLAR or ZFSAS, the Hybrid Columnar Compression (HCC) functionality can be enabled for the objects in the disk group. Exadata already supports HCC.

NOTE: The ZFS storage must be provisioned through Direct NFS (dNFS) and the Pillar Axiom storage must be provisioned via the SCSI or the Fiber Channel interface.

To set the STORAGE.TYPE attribute, the COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes must be set to 11.2.0.3 or higher. For maximum support with ZFS storage, set COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes to 11.2.0.4 or higher.

The STORAGE.TYPE attribute can be set when creating or altering a disk group. The attribute cannot be set when clients are connected to the disk group. For example, the attribute cannot be set when an ADVM volume is enabled on the disk group.

The attribute is not visible in the V$ASM_ATTRIBUTE view or with the ASMCMD lsattr command until the attribute has been set.

THIN_PROVISIONED [12c]

The THIN_PROVISIONED attribute enables or disables the functionality to discard unused storage space after a disk group rebalance is completed. The attribute value can be TRUE or FALSE (default).

Storage vendor products that support thin provisioning have the capability to reuse the discarded storage space for a more efficient overall physical storage utilization.

APPLIANCE.MODE [11.2.0.4, Exadata]

The APPLIANCE.MODE attribute improves the disk rebalance completion time when dropping one or more ASM disks. This means that redundancy is restored faster after a (disk) failure. The attribute is automatically enabled when creating a new disk group in Exadata. Existing disk groups must explicitly set the attribute using the ALTER DISKGROUP command. This feature is also known as fixed partnering.

The attribute can only be enabled on disk groups that meet the following requirements:

The Oracle ASM disk group attribute COMPATIBLE.ASM is set to release 11.2.0.4, or later.

The CELL.SMART_SCAN_CAPABLE attribute is set to TRUE.

All disks in the disk group are the same type of disk, such as all hard disks or all flash disks.

All disks in the disk group are the same size.

All failure groups in the disk group have an equal number of disks.

No disk in the disk group is offline.

Minimum software: Oracle Exadata Storage Server Software release 11.2.3.3 running Oracle Database 11g Release 2 (11.2) release 11.2.0.4

NOTE: This feature is not available on Oracle Database version 12.1.0.1.

Hidden disk group attributes

Not all disk group attributes are documented. Here are some of the more interesting ones.

_REBALANCE_COMPACT

The _REBALANCE_COMPACT attribute is related to the compacting phase of the rebalance. The attribute value can be TRUE (default) or FALSE. Setting the attribute to FALSE, disables the compacting phase of the disk group rebalance.

_EXTENT_COUNTS

The _EXTENT_COUNTS attribute, is related to the variable extents size feature, that determines the points at which the extent size will be incremented.

The value of the attribute is “20000 20000 214748367”, which means that the first 20000 extent sizes will be 1 AU, next 20000 extents will have the size determined by the second value of the _EXTENT_SIZES attribute, and the rest will have the size determined by the third value of the _EXTENT_SIZES attribute.

_EXTENT_SIZES

The _EXTENT_SIZES is the second attribute relevant to the variable extents size feature, and it determines the extent size increments - in the number of AUs.

In ASM version 11.1 the attribute value is "1 8 64". In AM version 11.2 and later, the value of the _EXTENT_SIZES is "1 4 16".

V$ASM_ATTRIBUTE view and ASMCMD lsattr command

The disk group attributes can be queried via the V$ASM_ATTRIBUTE view and the ASMCMD lsattr command.

This is one way to list the attributes for disk group PLAY:

$ asmcmd lsattr -G PLAY –l

Name Value

access_control.enabled FALSE

access_control.umask 066

au_size 4194304

cell.smart_scan_capable FALSE

compatible.asm 11.2.0.0.0

compatible.rdbms 11.2.0.0.0

disk_repair_time 3.6h

sector_size 512

The disk group attributes can be modified via SQL ALTER DISKGROUP SET ATTRIBUTE, ASMCMD setattr command and the ASMCA. This is an example of using the ASMCMD setattr command to modify the DISK_REPAIR_TIME attribute for disk group PLAY:

$ asmcmd setattr -G PLAY disk_repair_time '4.5 H'

Check the new value:

$ asmcmd lsattr -G PLAY -l disk_repair_time

Name Value

disk_repair_time 4.5 H

Conclusion

Disk group attributes, introduced in ASM version 11.1, are a great way to fine tune the disk group capabilities. Some of the attributes are Exadata specific (as marked) and some are available in certain versions only (as marked). Most disk group attributes are documented and accessible via the V$ASM_ATTRIBUTE view. Some of the undocumented attributes were also discussed and those should not be modified unless advised by Oracle Support.

↧

How to reconfigure Oracle Restart

July 27, 2014, 12:59 am

≫ Next: How to resize grid disks in Exadata

≪ Previous: Disk Group Attributes

The other day I had to reconfigure an Oracle Restart 12c environment, and I couldn't find my blog post on that. It turns out that I never published it here, as my MOS document on this topic was created back in 2010 when this blog didn't exist!

The original document was written for Oracle Restart 11gR2, but it is still valid for 12c. Here it is.

Introduction

This document is about reconfiguring the Oracle Restart. One reason for such action might be the server rename. If the server was renamed and then rebooted, the ASM instance startup would fail with ORA-29701: unable to connect to Cluster Synchronization Service.

The solution is to reconfigure Oracle Restart as follows.

1. Remove Oracle Restart configuration

This step should be performed as the privileged user (root).

# $GRID_HOME/crs/install/roothas.pl -deconfig -force

The expected result is "Successfully deconfigured Oracle Restart stack".

2. Reconfigure Oracle Restart

This step should also be performed as the privileged user (root).

# $GRID_HOME/crs/install/roothas.pl

The expected result is "Successfully configured Oracle Grid Infrastructure for a Standalone Server"

3. Add ASM back to Oracle Restart configuration

This step should be performed as the Grid Infrastructure owner (grid user).

$ srvctl add asm

The expected result is no output, just a return to the operating system prompt.

4. Start ASM instance

This step should be performed as the Grid Infrastructure owner (grid user).

$ srvctl start asm

That should start the ASM instance.

Note that at this time there will be no ASM initialization or server parameter file.

5. Recreate ASM server parameter file (SPFILE)

This step should be performed as the Grid Infrastructure owner (grid user).

Create a temporary initialization parameter file (e.g. /tmp/init+ASM.ora) with the content similar to the this (of course, with your own disk group names):

asm_diskgroups='DATA','RECO'

large_pool_size=12M

remote_login_passwordfile='EXCLUSIVE'

Mount the disk group where the new server parameter file (SPFILE) will reside (e.g. DATA) and create the SPFILE:

$ sqlplus / as sysasm

SQL> alter diskgroup DATA mount;

Diskgroup altered.

SQL> create spfile='+DATA' from pfile='/tmp/init+ASM.ora';

File created.

SQL> show parameter spfile

NAME TYPE VALUE

------ ------ -------------------------------------------------

spfile string +DATA/asm/asmparameterfile/registry.253.707737977

6. Restart HAS stack

This step should be performed as the Grid Infrastructure owner (grid user).

$ crsctl stop has

$ crsctl start has

7. Add components back to Oracle Restart configuration

Add the database, the listener and other components, back into the Oracle Restart configuration.

7.1. Add database

This step should be performed as RDBMS owner (oracle user).

In 11gR2 the command is:

$ srvctl add database -d db_unique_name -o oracle_home

In 12c the command is:

$ srvctl add database -db db_unique_name -oraclehome oracle_home

7.2. Add listener

This step should be performed as the Grid Infrastructure owner (grid user).

$ srvctl add listener

7.3. Add other components

For information on how to add back additional components in 11gR2 see:

Oracle® Database Administrator's Guide 11g Release 2 (11.2)

Chapter 4 Configuring Automatic Restart of an Oracle Database

Section Configuring Oracle Restart

For 12c see:

Oracle® Database Administrator's Guide 12c Release 1 (12.1)

Chapter 4 Configuring Automatic Restart of an Oracle Database

Section Configuring Oracle Restart

Conclusion

As noted earlier, I have published a MOS document on this: How to Reconfigure Oracle Restart - MOS Doc ID 986740.1.

↧

How to resize grid disks in Exadata

August 9, 2014, 3:10 am

≫ Next: REQUIRED_MIRROR_FREE_MB

≪ Previous: How to reconfigure Oracle Restart

This document explains how to resize the grid disks in Exadata (to make them larger), when there is free space in the cell disks. The free space can be anywhere on the cell disks. In other words, the grid disks can be built from and extended with the non-contiguous free space.

Typically, there is no free space in Exadata cell disks, in which case the MOS Doc ID 1465230.1 needs to be followed. But if there is free space in the cell disks, the procedure is much simpler and it can be accomplished with a single ASM rebalance operation.

This document has an example of performing this task on a quarter rack system (two database servers and three storage cells). With an Exadata with more storage cells, the only additional steps would be to resize the grid disks on additional storage cells.

Storage cells in the example are exacell01, exacell02 and exacell03, the disk group is DATA and the new grid disk size is 100000 MB.

Resize grid disks on storage cells

# cellcli -e alter griddisk DATA_CD_00_exacell01, DATA_CD_01_exacell01, DATA_CD_02_exacell01, DATA_CD_03_exacell01, DATA_CD_04_exacell01, DATA_CD_05_exacell01, DATA_CD_06_exacell01, DATA_CD_07_exacell01, DATA_CD_08_exacell01, DATA_CD_09_exacell01, DATA_CD_10_exacell01, DATA_CD_11_exacell01 size=100000M;

# cellcli -e alter griddisk DATA_CD_00_exacell02, DATA_CD_01_exacell02, DATA_CD_02_exacell02, DATA_CD_03_exacell02, DATA_CD_04_exacell02, DATA_CD_05_exacell02, DATA_CD_06_exacell02, DATA_CD_07_exacell02, DATA_CD_08_exacell02, DATA_CD_09_exacell02, DATA_CD_10_exacell02, DATA_CD_11_exacell02 size=100000M;

# cellcli -e alter griddisk DATA_CD_00_exacell03, DATA_CD_01_exacell03, DATA_CD_02_exacell03, DATA_CD_03_exacell03, DATA_CD_04_exacell03, DATA_CD_05_exacell03, DATA_CD_06_exacell03, DATA_CD_07_exacell03, DATA_CD_08_exacell03, DATA_CD_09_exacell03, DATA_CD_10_exacell03, DATA_CD_11_exacell03 size=100000M;

As noted earlier, If you have a larger system, e.g. Exadata half rack with 7 storage cells, resize the grid disks for disk group DATA on all other storage cells.

Resize ASM disks

$ sqlplus / as sysasm

Resize all disks in disk group DATA, with the following command:

SQL> ALTER DISKGROUP DATA RESIZE ALL;

Note that there was no need to specify the new disk size, as ASM will get that from the grid disks. If you would like to speed up the rebalance, add REBALANCE POWER 32 to the above command.

The command will trigger the rebalance operation for disk group DATA.

Monitor the rebalance with the following command:

SQL> select * from gv$asm_operation;

Once the command returns "no rows selected", the rebalance would have completed and all disks in disk group DATA should show new size.

Check that the ASM sees the new size, with the following command:

SQL> select name, total_mb from v$asm_disk_stat where name like 'DATA%';

The TOTAL_MB should show 100000M for all disks in disk group DATA.

Conclusion

If there is free space in Exadata cell disks, the disk group resize can be accomplished in two steps - grid disk resize on all storage cells followed by the disk resize in ASM. This requires a single ASM rebalance operation.

I have published this on MOS as Doc ID 1684112.1.

↧