REQUIRED_MIRROR_FREE_MB

September 27, 2014, 1:15 am

≪ Previous: How to resize grid disks in Exadata

The REQUIRED_MIRROR_FREE_MB and the USABLE_FILE_MB are two very interesting columns in the V$ASM_DISKGROUP[_STAT] view. Oracle Support gets many questions about the meaning of those columns and how the values are calculated. I wanted to write about this, but I realised that I could not do it better than Harald van Breederode, so I asked him for permission to simply reference his write up. He agreed, so please have a look at his excellent post Demystifying ASM REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB.

How much space can I use

Now that the REQUIRED_MIRROR_FREE_MB and the USABLE_FILE_MB have been explained, I would like to add that the ASM does not prevent you from using all available space - half of the total space for a normal redundancy disk group and one third of the total space for a high redundancy disk group. But if you do fill up your disk group to the brim, there will be no room to grow or add any files, and in the case of a disk failure, there will be no room to restore the redundancy for some data - until the failed disk is replaced and the rebalance completed.

Exadata with ASM version 11gR2

In Exadata with ASM version 11.2 the REQUIRED_MIRROR_FREE_MB is reported as the size of the largest failgroup [1] in the disk group. To demonstrate, let's look at an Exadata system with ASM version 11.2.0.4.

As in most Exadata installations, I have three disk groups.

[grid@exadb01 ~]$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.4.0 Production on [date]

SQL> select NAME, GROUP_NUMBER from v$asm_diskgroup_stat;

NAME GROUP_NUMBER

--------- ------------

DATA 1

DBFS_DG 2

RECO 3

SQL>

For the purpose of this example, we will look at the disk group DBFS_DG. Normally there would be 10 disks per failgroup for disk group DBFS_DG. I have dropped few disks to demonstrate that the REQUIRED_MIRROR_FREE_MB is reported as the size of the largest failgroup.

SQL> select FAILGROUP, count(NAME) "Disks", sum(TOTAL_MB) "MB"

from v$asm_disk_stat

where GROUP_NUMBER=2

group by FAILGROUP

order by 3;

FAILGROUP Disks MB

-------------------- ----------

EXACELL04 7 180096

EXACELL01 8 205824

EXACELL02 9 231552

EXACELL03 10 257280

SQL>

Note that the total space in the largest failgroup is 257280 MB.

Finally, we see that the REQUIRED_MIRROR_FREE_MB is reported as the size of the largest failgroup:

SQL> select NAME, TOTAL_MB, FREE_MB, REQUIRED_MIRROR_FREE_MB, USABLE_FILE_MB

from v$asm_diskgroup_stat

where GROUP_NUMBER=2;

NAME TOTAL_MB FREE_MB REQUIRED_MIRROR_FREE_MB USABLE_FILE_MB

---------- ---------- ---------- ----------------------- --------------

DBFS_DG 874752 801420 257280 272070

SQL>

The ASM calculates the USABLE_FILE_MB using the following formula:

USABLE_FILE_MB = (FREE_MB - REQUIRED_MIRROR_FREE_MB) / 2

Which gives 272070 MB.

[1] In Exadata, all failgroups are typically of the same size

Exadata with ASM version 12cR1

In Exadata with ASM version 12cR1, the REQUIRED_MIRROR_FREE_MB is reported as the size of the largest disk [2] in the disk group.

Here is an example from an Exadata system with ASM version 12.1.0.2.0.

[grid@exadb03 ~]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on [date]

SQL> select NAME, GROUP_NUMBER from v$asm_diskgroup_stat;

NAME GROUP_NUMBER

--------------------

DATA 1

DBFS_DG 2

RECO 3

SQL>

Again, I have the failgroups of different sizes in the disk group DBFS_DG:

SQL> select FAILGROUP, count(NAME) "Disks", sum(TOTAL_MB) "MB"

from v$asm_disk_stat

where GROUP_NUMBER=2

group by FAILGROUP

order by 3;

FAILGROUP Disks MB

---------- ---------- ----------

EXACELL05 8 238592

EXACELL07 9 268416

EXACELL06 10 298240

SQL>

The total space in the largest failgroup is 298240 MB, but this time the REQUIRED_MIRROR_FREE_MB is reported as 29824 MB:

SQL> select NAME, TOTAL_MB, FREE_MB, REQUIRED_MIRROR_FREE_MB, USABLE_FILE_MB

from v$asm_diskgroup_stat

where GROUP_NUMBER=2; 2 3

NAME TOTAL_MB FREE_MB REQUIRED_MIRROR_FREE_MB USABLE_FILE_MB

---------- ---------- ---------- ----------------------- --------------

DBFS_DG 805248 781764 29824 375970

SQL>

As we can see, that is the size of the largest disk, in the diskgroup:

SQL> select max(TOTAL_MB) from v$asm_disk_stat where GROUP_NUMBER=2;

MAX(TOTAL_MB)

-------------

29824

SQL>

The USABLE_FILE_MB was calculated using the same formula:

USABLE_FILE_MB = (FREE_MB - REQUIRED_MIRROR_FREE_MB) / 2

Which gives 375970 MB.

[2] In Exadata, all disks are typically of the same size

Conclusion

The REQUIRED_MIRROR_FREE_MB and the USABLE_FILE_MB are intended to assist the DBAs and storage administrators with planning the disk group capacity and redundancy. The values are reported, but not enforced by the ASM.

In Exadata with ASM version 12cR1, the REQUIRED_MIRROR_FREE_MB is reported as the size of the largest disk in the disk group. This is by design, to reflect the experience from the field, which shows that the disks are the components that are failing, not the whole storage cells.

↧

Find block in ASM

October 23, 2014, 1:58 am

≫ Next: ASM files number 12 and 254

≪ Previous: REQUIRED_MIRROR_FREE_MB

In the post Where is my data I have shown how to locate and extract an Oracle datafile block from ASM. To make things easier, I have now created a Perl script find_block.pl that automates the process - you provide the datafile name and the block number, and the script generates the command to extract the data block from ASM.

find_block.pl

The find_block.pl is a Perl script that constructs the dd or the kfed command to extract a block from ASM. It should work with all Linux and Unix ASM versions and with local (non-flex) ASM in the standalone (single instance) or cluster environments.

The script should be run as the ASM/Grid Infrastructure owner, using the perl binary in the ASM oracle home. In a cluster environment, the script can be run from any node. Before running the script, set the ASM environment and make sure the ORACLE_SID, ORACLE_HOME, LD_LIBRARY_PATH, etc are set correctly. For ASM versions 10g and 11gR1, also set the environment variable PERL5LIB, like this:

export PERL5LIB=$ORACLE_HOME/perl/lib/5.8.3:$ORACLE_HOME/perl/lib/site_perl

Run the script as follows:

$ORACLE_HOME/perl/bin/perl find_block.pl filename block

Where:

filename is the name of the file from which to extract the block. For a datafile, the file name can be obtained from the database instance with SELECT NAME FROM V$DATAFILE.
block is the block number to be extracted from ASM.

The output should look like this:

Use the following command(s) to extract block N into the binary file block_N.dd:

dd if=[ASM disk path] ... of=block_N.dd

Or in Exadata:

Use the following command(s) to extract block N into the text file block_N.txt:

kfed read dev=[ASM disk path] ... > block_N.txt

If the file redundancy is external, the script would generate a single command. For a normal redundancy file, the script would generate two commands, and for the high redundancy file the script would generate three commands.

Example with ASM version 10.2.0.1

The first example is with a single instance ASM version 10.2.0.1. I first create the table and insert some data, in the database instance, of course.

[oracle@cat10g ~]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on [date]

SQL> create table TAB1 (n number, name varchar2(16)) tablespace B1;

Table created.

SQL> insert into TAB1 values (1, 'ONE');

1 row created.

SQL> insert into TAB1 values (1, 'TWO');

1 row created.

SQL> insert into TAB1 values (1, 'THREE');

1 row created.

SQL> commit;

Commit complete.

SQL>

Now get the filename and the block number.

SQL> select t.name "Tablespace", f.name "Datafile"

from v$tablespace t, v$datafile f

where t.ts#=f.ts# and t.name='B1';

Tablespace Datafile

---------- -----------------------------------

B1 +DATA/cat/datafile/b1.268.783714605

SQL> select ROWID, NAME from TAB1;

ROWID NAME

------------------ -----

AAANE0AAGAAAAylAAA ONE

AAANE0AAGAAAAylAAB TWO

AAANE0AAGAAAAylAAC THREE

SQL> select DBMS_ROWID.ROWID_BLOCK_NUMBER('AAANE0AAGAAAAylAAA') "Block number" from dual;

Block number

------------

3237

SQL>

Switch to the ASM environment, set PERL5LIB, and run the script.

$ export PERL5LIB=$ORACLE_HOME/perl/lib/5.8.3:$ORACLE_HOME/perl/lib/site_perl

$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/cat/datafile/b1.268.783714605 3237

Use the following command(s) to extract block 3237 into the binary file block_3237.dd:

dd if=/dev/oracleasm/disks/ASMDISK01 bs=8192 count=1 skip=150437 of=block_3237.dd

From the output of the find_block.pl, I see that the specified file is external redundancy, as the script produced a single dd command. Run the command to generate the block_3237.dd file:

$ dd if=/dev/oracleasm/disks/ASMDISK01 bs=8192 count=1 skip=150437 of=block_3237.dd

Looking at the content of the block_3237.dd file, with the od utility, I see the data inserted in the table:

$ od -c block_3237.dd

...

0017740 301 002 005 T H R E E , 001 002 002 301 002 003 T

0017760 W O , 001 002 002 301 002 003 O N E 001 006 y 305

0020000

Example with ASM version 12.1.0.1 in Exadata

In Exadata we cannot use the dd command to extract the block, as the ASM disks are not visible from the database server. To get the database block, we can use the kfedtool, so the find_block.pl will construct a kfed command that can be used to extract the block from ASM.

Let's have a look at an example with ASM version 12.1.0.1, in a two node cluster, with the datafile in a pluggable database in Exadata.

As in the previous example, I first create the table and insert some data.

$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on [date]

SQL> alter pluggable database BR_PDB open;

Pluggable database altered.

SQL> show pdbs

CON_ID CON_NAME OPEN MODE RESTRICTED

------ -------- ----------- ----------

2 PDB$SEED READ ONLY NO

...

5 BR_PDB READ WRITE NO

SQL>

$ sqlplus bane/welcome1@BR_PDB

SQL*Plus: Release 12.1.0.1.0 Production on [date]

SQL> create table TAB1 (n number, name varchar2(16)) tablespace USERS;

Table created.

SQL> insert into TAB1 values (1, 'CAT');

1 row created.

SQL> insert into TAB1 values (2, 'DOG');

1 row created.

SQL> commit;

Commit complete.

SQL> select t.name "Tablespace", f.name "Datafile"

from v$tablespace t, v$datafile f

where t.ts#=f.ts# and t.name='USERS';

Tablespace Datafile

---------- ---------------------------------------------

USERS +DATA/CDB/054.../DATAFILE/users.588.860861901

SQL> select ROWID, NAME from TAB1;

ROWID NAME

------------------ ----

AAAWYEABfAAAACDAAA CAT

AAAWYEABfAAAACDAAB DOG

SQL> select DBMS_ROWID.ROWID_BLOCK_NUMBER('AAAWYEABfAAAACDAAA') "Block number" from dual;

Block number

------------

131

SQL>

Switch to the ASM environment, and run the script.

$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/CDB/0548068A10AB14DEE053E273BB0A46D1/DATAFILE/users.588.860861901 131

Use the following command(s) to extract block 131 into the text file block_131.txt:

kfed read dev=o/192.168.1.9/DATA_CD_03_exacelmel05 ausz=4194304 aunum=16212 blksz=8192 blknum=131 | grep -iv ^kf > block_131.txt

kfed read dev=o/192.168.1.11/DATA_CD_09_exacelmel07 ausz=4194304 aunum=16267 blksz=8192 blknum=131 | grep -iv ^kf > block_131.txt

Note that the find_block.pl generated two commands, as that datafile is normal redundancy. Run one of the commands:

$ kfed read dev=o/192.168.1.9/DATA_CD_03_exacelmel05 ausz=4194304 aunum=16212 blksz=8192 blknum=131 | grep -iv ^kf > block_131.txt

Review the content of the block_131.txt file (note that this is a text file). Sure enough I see my DOG and my CAT:

$ more block_131.txt

...

FD5106080 00000000 00000000 ... [................]

Repeat 501 times

FD5107FE0 00000000 00000000 ... [........,......D]

FD5107FF0 012C474F 02C10202 ... [OG,......CAT..,-]

Find any block

The find_block.pl can be used to extract a block from any file stored in ASM. Just for fun, I ran the script on a controlfile and a random block:

$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/CDB/CONTROLFILE/current.289.843047837 5

Use the following command(s) to extract block 5 into the text file block_5.txt:

kfed read dev=o/192.168.1.9/DATA_CD_10_exacelmel05 ausz=4194304 aunum=73 blksz=16384 blknum=5 | grep -iv ^kf > block_5.txt

kfed read dev=o/192.168.1.11/DATA_CD_01_exacelmel07 ausz=4194304 aunum=66 blksz=16384 blknum=5 | grep -iv ^kf > block_5.txt

kfed read dev=o/192.168.1.10/DATA_CD_04_exacelmel06 ausz=4194304 aunum=78 blksz=16384 blknum=5 | grep -iv ^kf > block_5.txt

Keen observer will notice that the script worked out the correct block size for the controlfile (16k) and that it generated three different commands. While the disk group DATA is normal redundancy, the controlfile is high redundancy (default redundancy for the controlfile in ASM).

Conclusion

The find_block.pl is a Perl script that construct the dd or the kfed command to extract a block from a file in ASM. In most cases we want to extract a block from a datafile, but the script can be used to extract a block from a controlfile, redo log, or any other file in ASM.

If the file is external redundancy, the script will generate a single command, that can be used to extract the block from the ASM disk.

If the file is normal redundancy, the script will generate two commands, that can be used to extract the (copies of the same) block from two different ASM disks. This can be handy, for example in cases where a corruption is reported against one of the blocks and for some reason the ASM cannot repair it.

If the file is high redundancy, the script will generate three commands.

To use the script you don't have to know the file redundancy, the block size or any other file attribute. All that is required is the file name and the block number.

↧

ASM files number 12 and 254

January 12, 2012, 3:42 am

≫ Next: ASM in Exadata

≪ Previous: Find block in ASM

The staleness directory (ASM file number 12) contains metadata to map the slots in the staleness registry to particular disks and ASM clients. The staleness registry (ASM file number 254) tracks allocation units that become stale while the disks are offline. This applies to normal and high redundancy disk groups with the attribute COMPATIBLE.RDBMS set to 11.1 or higher. The staleness metadata is created when needed, and grows to accommodate additional offline disks.

When a disk goes offline, each RDBMS instance gets a slot in the staleness registry for that disk. This slot has a bit for each allocation unit in the offline disk. When an RDBMS instance I/O write is targeted for an offline disk, that instance sets the corresponding bit in the staleness registry.

When a disk is brought back online, ASM copies the allocation units, that have the staleness registry bit set, from the mirrored extents. Because only allocation units that should have changed while the disk was offline are updated, bringing a disk online is more efficient then adding a disk if was dropped instead of just offlined.

No stale disks

The staleness metadata structures are created as needed, which means the staleness directory and registry do not exist when all disks are online.

SQL> SELECT g.name "Disk group",
g.group_number "Group#",
d.disk_number "Disk#",
d.name "Disk",
d.mode_status "Disk status"
FROM v$asm_disk d, v$asm_diskgroup g
WHERE g.group_number=d.group_number and g.group_number<>0
ORDER BY 1, 2, 3;

Disk group Group# Disk# Disk Disk status
------------ ---------- ---------- ------------ ------------
DATA 1 0 ASMDISK1 ONLINE
1 ASMDISK2 ONLINE
2 ASMDISK3 ONLINE
RECO 2 0 ASMDISK4 ONLINE
1 ASMDISK5 ONLINE
2 ASMDISK6 ONLINE

SQL> SELECT x.number_kffxp "File#",
x.disk_kffxp "Disk#",
x.xnum_kffxp "Extent",
x.au_kffxp "AU",
d.name "Disk name"
FROM x$kffxp x, v$asm_disk_stat d
WHERE x.group_kffxp=d.group_number
and x.disk_kffxp=d.disk_number
and x.number_kffxp in (12, 254)
ORDER BY 1, 2;

no rows selected

Stale disks

Staleness information will be created when a disk goes offline, but only when there are I/O writes intended for offline disks.

In the following example, I will offline the disk manually, with the ALTER DISKGROUP OFFLINE DISK command. But as far as stalenss metadata is concerned, it will be created irrespective of how and why a disk goes offline.

SQL> alter diskgroup RECO offline disk ASMDISK6;

Diskgroup altered.

SQL> SELECT g.name "Disk group",
g.group_number "Group#",
d.disk_number "Disk#",
d.name "Disk",
d.mode_status "Disk status"
FROM v$asm_disk d, v$asm_diskgroup g
WHERE g.group_number=d.group_number and g.group_number=2
ORDER BY 1, 2, 3;

Disk group Group# Disk# Disk Disk status
------------ ---------- ---------- ------------ ------------
RECO 2 0 ASMDISK4 ONLINE
1 ASMDISK5 ONLINE
2 ASMDISK6 OFFLINE

Database keeps writing to this disk group, and after a while we see the staleness directory and staleness registry created for this disk group

SQL> SELECT x.number_kffxp "File#",
x.disk_kffxp "Disk#",
x.xnum_kffxp "Extent",
x.au_kffxp "AU",
d.name "Disk name"
FROM x$kffxp x, v$asm_disk_stat d
WHERE x.group_kffxp=d.group_number
and x.disk_kffxp=d.disk_number
and d.group_number=2
and x.number_kffxp in (12, 254)
ORDER BY 1, 2;

File# Disk# Extent AU Disk name
---------- ---------- ---------- ---------- ------------------------------
12 0 0 86 ASMDISK4
1 0 101 ASMDISK5
2 0 4294967294 ASMDISK6
254 0 0 85 ASMDISK4
1 0 100 ASMDISK5
2 0 4294967294 ASMDISK6

Look inside

There is not much to see in the actual metadata. Even kfed struggles to recognise these types of metadata blocks :)

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=86 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 21 ; 0x002: *** Unknown Enum ***
...
kffdnd.bnode.incarn: 1 ; 0x000: A=1 NUMM=0x0
kffdnd.bnode.frlist.number: 4294967295 ; 0x004: 0xffffffff
kffdnd.bnode.frlist.incarn: 0 ; 0x008: A=0 NUMM=0x0
kffdnd.overfl.number: 4294967295 ; 0x00c: 0xffffffff
kffdnd.overfl.incarn: 0 ; 0x010: A=0 NUMM=0x0
kffdnd.parent.number: 0 ; 0x014: 0x00000000
kffdnd.parent.incarn: 1 ; 0x018: A=1 NUMM=0x0
kffdnd.fstblk.number: 0 ; 0x01c: 0x00000000
kffdnd.fstblk.incarn: 1 ; 0x020: A=1 NUMM=0x0
kfdsde.entry.incarn: 1 ; 0x024: A=1 NUMM=0x0
kfdsde.entry.hash: 0 ; 0x028: 0x00000000
kfdsde.entry.refer.number: 4294967295 ; 0x02c: 0xffffffff
kfdsde.entry.refer.incarn: 0 ; 0x030: A=0 NUMM=0x0
kfdsde.cid: +ASMR ; 0x034: length=5
kfdsde.indlen: 1 ; 0x074: 0x0001
kfdsde.flags: 0 ; 0x076: 0x0000
kfdsde.spare1: 0 ; 0x078: 0x00000000
kfdsde.spare2: 0 ; 0x07c: 0x00000000
kfdsde.indices[0]: 0 ; 0x080: 0x00000000
kfdsde.indices[1]: 0 ; 0x084: 0x00000000
kfdsde.indices[2]: 0 ; 0x088: 0x00000000
...

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=85 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 20 ; 0x002: *** Unknown Enum ***
...
kfdsHdrB.clientId: 1297301881 ; 0x000: 0x4d534179
kfdsHdrB.incarn: 0 ; 0x004: 0x00000000
kfdsHdrB.dskNum: 2 ; 0x008: 0x0002
kfdsHdrB.ub2spare: 0 ; 0x00a: 0x0000
ub1[0]: 0 ; 0x00c: 0x00
ub1[1]: 0 ; 0x00d: 0x00
ub1[2]: 0 ; 0x00e: 0x00
ub1[3]: 0 ; 0x00f: 0x00
ub1[4]: 0 ; 0x010: 0x00
ub1[5]: 0 ; 0x011: 0x00
ub1[6]: 0 ; 0x012: 0x00
ub1[7]: 16 ; 0x013: 0x10
ub1[8]: 0 ; 0x014: 0x00
...

Not much to see, as these are just bitmaps.

Conclusion

The staleness directory and staleness registry are supporting metadata structure for the disk offline and fast resync feature introduced in ASM version 11. The staleness directory contains metadata to map the slots in the staleness registry to particular disks and ASM clients. The staleness registry tracks allocation units that become stale while the disks are offline. This feature is relevant to normal and high redundancy disk groups only.

↧

ASM in Exadata

April 1, 2012, 4:21 am

≫ Next: Exadata motherboard replacement

≪ Previous: ASM files number 12 and 254

ASM is a critical component of the Exadata software stack. It is also a bit different - compared to non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It still takes care of disk errors, but also handles predictive disk failures. It doesn't like external redundancy, but it makes the disk group smart scan capable. Let's have a closer look.

Grid disks

In Exadata the ASM disks live on storage cells and are presented to compute nodes (where ASM instances run) via Oracle proprietary iDB protocol. Each storage cell has 12 hard disks and 16 flash disks. During Exadata deployment grid disks are created on those 12 hard disks. Flash disks are used for the flash and redo log cache, so grid disks are normally not created on flash disks.

Grid disks are not exposed to the Operating System, so only database instances, ASM and related utilities, that speak iDB, can see them. The kfod, ASM discovery tool, is one such utility. Here is an example of kfod discovering grid disks in one Exadata environment:

$ kfod disks=all
-----------------------------------------------------------------
Disk Size Path User Group
=================================================================

1: 433152 Mb o/192.168.10.9/DATA_CD_00_exacell01
2: 433152 Mb o/192.168.10.9/DATA_CD_01_exacell01
3: 433152 Mb o/192.168.10.9/DATA_CD_02_exacell01
...
13: 29824 Mb o/192.168.10.9/DBFS_DG_CD_02_exacell01
14: 29824 Mb o/192.168.10.9/DBFS_DG_CD_03_exacell01
15: 29824 Mb o/192.168.10.9/DBFS_DG_CD_04_exacell01
...
23: 108224 Mb o/192.168.10.9/RECO_CD_00_exacell01
24: 108224 Mb o/192.168.10.9/RECO_CD_01_exacell01
25: 108224 Mb o/192.168.10.9/RECO_CD_02_exacell01
...
474: 108224 Mb o/192.168.10.22/RECO_CD_09_exacell14
475: 108224 Mb o/192.168.10.22/RECO_CD_10_exacell14
476: 108224 Mb o/192.168.10.22/RECO_CD_11_exacell14

-----------------------------------------------------------------
ORACLE_SID ORACLE_HOME
=================================================================
+ASM1 /u01/app/11.2.0.3/grid
+ASM2 /u01/app/11.2.0.3/grid
+ASM3 /u01/app/11.2.0.3/grid
...
+ASM8 /u01/app/11.2.0.3/grid
$

Note that grid disks are prefixed with either DATA, RECO or DBFS_DG. Those are ASM disk group names in this environment. Each grid disk name ends with the storage cell name. It is also important to note that disks with the same prefix have the same size. The above example is from a full rack - hence 14 storage cells and 8 ASM instances.

ASM_DISKSTRING

In Exadata ASM_DISKSTRING='o/*/*'. That is suggesting to ASM that it is running on an Exadata compute node and to expect grid disks.

$ sqlplus / as sysasm
SQL> show parameter asm_diskstring
NAME TYPE VALUE
-------------- ------ -----
asm_diskstring string o/*/*

Automatic failgroups

There are no external redundancy disk groups in Exadata - you have a choice of either normal or high redundancy. When creating disk groups, ASM automatically puts all grid disks from the same storage cell into the same failgroup. The failgroup is then named after the storage cell.

This would be an example of creating a diskgroup in Exadata environment (note how that grid disk prefix comes in handy):

SQL> create diskgroup RECO
disk 'o/*/RECO*'
attribute
'COMPATIBLE.ASM'='11.2.0.0.0',
'COMPATIBLE.RDBMS'='11.2.0.0.0',
'CELL.SMART_SCAN_CAPABLE'='TRUE';

Once the disk group is created we can check the disk and failgroup names:

SQL> select name, failgroup, path from v$asm_disk_stat where name like 'RECO%';

NAME FAILGROUP PATH
-------------------- --------- -----------------------------------
RECO_CD_08_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_08_exacell01
RECO_CD_07_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_07_exacell01
RECO_CD_01_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_01_exacell01
...
RECO_CD_00_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_00_exacell02
RECO_CD_05_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_05_exacell02
RECO_CD_04_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_04_exacell02
...

SQL>

Note that we did not specify the failgroup names in the CREATE DISKGROUP statement. ASM has automatically put grid disks from the same storage cell in the same failgroup.

cellip.ora

The cellip.ora is the configuration file, on every database server, that tells ASM instances which cells are available to the cluster.

Here is a content of a typical cellip.ora file for a quarter rack system:

$ cat /etc/oracle/cell/network-config/cellip.ora
cell="192.168.10.3"
cell="192.168.10.4"
cell="192.168.10.5"

Now that we see what is in the cellip.ora, the grid disk path, in the examples above, should make more sense.

Disk group attributes

The following attributes and their values are recommended in Exadata environments:

COMPATIBLE.ASM - Should be set to the ASM software version in use.
COMPATIBLE.RDBMS - Should be set to the database software version in use.
CELL.SMART_SCAN_CAPABLE - Has be set to TRUE. This attribute/value is actually mandatory in Exadata.
AU_SIZE - Should be set to 4M. This is the default value in recent ASM versions for Exadata environments.

Initialization parameters

The following recommendations are for ASM version 11.2.0.3.

Parameter	Value
CLUSTER_INTERCONNECTS	Bondib0 IP address for X2-2. Colon delimited Bondib* IP addresses for X2-8.
ASM_POWER_LIMIT	1 for a quarter rack, 2 for all other racks.
SGA_TARGET	1250 MB
PGA_AGGREGATE_TARGET	400 MB
MEMORY_TARGET	0
MEMORY_MAX_TARGET	0
PROCESSES	For less than 10 instances per node: 50(#db instances per node + 1). For 10 0r more more instances per node: [50MIN(#db instances per node + 1, 11)] + [10*MAX(#db instance per node - 10, 0)]
USE_LARGE_PAGES	ONLY

Voting disks and disk group redundancy

Default location for voting disks in Exadata is ASM disk group DBFS_DG. That disk group can be either normal or high redundancy, except in a quarter rack where it has to be a normal redundancy.

This is because of the voting disks requirement for the minimal number of failgroups in a given ASM disk group. If we put voting disks in a normal redundancy disk group, that disk group has to have at least 3 failgroups. If we put voting disks in a high redundancy disk group, that disk group has to have at least 5 failgroups.

In a quarter rack, where we have only 3 storage cells, all disk groups can have at most 3 failgroups. While we can create a high redundancy disk group with 3 storage cells, voting disks cannot go into that disk group as it does not have 5 failgroups.

XDMG and XDWK background processes

These two process run in ASM instances on compute nodes. XDMG monitors all configured Exadata cells for storage state changes and performs the required tasks for such events. Its primary role is to watch for inaccessible disks and to initiate the disk online operations, when they become accessible again. Those operations are then handled by XDWK.

XDWK gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk performance degrades it can put it into proactive failure mode. It also monitors for predictive failures based on the disk's SMART (Self-monitoring, Analysis and Reporting Technology) data. In both cases, the Exadata Server notifies XDMG to take those disks offline.

When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk groups, in case they were already dropped.

The diskmon

The master diskmon process (diskmon.bin) can be seen running in all Grid Infrastructure installs, but it's only in Exadata that it's actually doing any work. On every compute node there will be one master diskmon process and one DSKM, slave diskmon process, per every Oracle instance (including ASM). Here is an example from one compute node:

# ps -ef | egrep "diskmon|dskm" | grep -v grep
oracle 3205 1 0 Mar16 ? 00:01:18 ora_dskm_ONE2
oracle 10755 1 0 Mar16 ? 00:32:19 /u01/app/11.2.0.3/grid/bin/diskmon.bin -d -f
oracle 17292 1 0 Mar16 ? 00:01:17 asm_dskm_+ASM2
oracle 24388 1 0 Mar28 ? 00:00:21 ora_dskm_TWO2
oracle 27962 1 0 Mar27 ? 00:00:24 ora_dskm_THREE2
#

In Exadata, the diskmon is responsible for

Handling of storage cell failures and I/O fencing
Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
Broadcasting intra database IORM (I/O Resource Manager) plans from databases to storage cells
Monitoring or the control messages from database and ASM instances to storage cells
Communicating with other diskmons in the cluster

ACFS

The ACFS (ASM Cluster File System) is supported in Exadata environments staring with ASM version 12.1.0.2. Alternatives to the ACFS are the DBFS (Database based File System) and the NFS (Network File System). Many Exadata customers have an Oracle ZFS Appliance that can provide a high performance, InfiniBand connected, NFS storage.

Conclusion

There are quite a few extra features and differences in ASM compared to non-Exadata environments. Most of them are about storage cells and grid disks, and some are about tuning ASM for the extreme Exadata performance.

↧

Exadata motherboard replacement

June 22, 2012, 6:08 am

≫ Next: When will my rebalance complete

≪ Previous: ASM in Exadata

This photo story is about a motherboard replacement in an Exadata X2-2 database server (Sun Fire X4170 M2).

While setting up a half rack Exadata system, we discovered a problem with the motherboard in database server 4, so we got ourselves a new one. Even for an internal system, I had to log an SR, get our hardware team to review the diagnostics and have them confirm that a new motherboard is required. I had a full customer experience with me being both the customer and Support Engineer! Anyway, a replacement motherboard arrived the next day.

We stopped the clusterware on that database server and powered it down. It was time to take the new motherboard to that noisy computer room.

We disconnected both power cords from the server, but we left all other cables plugged in. We then (slowly, very slowly) pulled the server out of the rack (it's on rails, so it just slides out). In a field engineer lingo, we 'extended the server to the maintenance position'. All cables were still plugged in, but the cable management arm took good care of them. From the photo bellow, it can be seen that this was a half rack system, with all other database servers and storage cells up and running.

Once the server was fully extended, we removed the top cover and disconnected the cables (Ethernet, InfiniBand and KVM). We then removed the PCIe cards (RAID controller, dual port 10Gb Ethernet and dual port InfiniBand HCA). On the photo bellow, the PCIe cards and memory modules were already removed so the motherboard is fully exposed.

Motherboard was then taken out of the chassis. On the photo below we see the dual power supply (top left), two CPU heat sinks (removed from the CPUs and resting on top of the power supply), row of fans (middle) and disk drives (right). Well, we cannot see the disk drives as they are covered, but they are there under those tools.

It was then time to transfer the CPUs and memory modules from the old motherboard to the new one.

The memory modules are easy to put in as they just click in. The CPUs are tricky with all their pins, so they need to be put in very carefully.

After that the new motherboard was ready to go into the server. On the photo bellow we can see two CPUs, without the heat sinks, and two rows of memory modules. This server had 'only' 96MB of RAM, with plastic fillers in 6 slots. There is room for the total of 144GB of RAM on that motherboard.

The motherboard is shipped with the thermal paste, so now was the time to apply that paste on top of the CPUs, and put the heat sinks on top of them.

The photo bellow shows that all parts were back on the motherboard. Note that three PCIe cards are not plugged into the motherboard directly. They are plugged into the little raiser boards and those are plugged into the motherboard.

Plug back in all the cables (Ethernet, InfiniBand and KVM), put the cover on and slide the server back into the rack. Connect the power cables and turn the server on.

Our server came up fine and the only thing we had to do was set up ILOM. That is done by running /opt/oracle.cellos/ipconf.pl and specifying the ILOM name, IP address and other network details. We also had to reset the ILOM password.

Finally, we started up the clusterware and all services came up fine. All this was done with only a single database server out of action (for about an hour and a half) with the clusterware and database(s) running on the remaining three nodes in the cluster.

↧

When will my rebalance complete

July 22, 2012, 2:00 am

≫ Next: Where is my data

≪ Previous: Exadata motherboard replacement

This has to be one of the top ASM questions people ask me. But if you expect me to respond with a number of minutes, you will be disappointed. After all, ASM has given you an estimate, and you still want to know when exactly is that rebalance going to finish. Instead, I will show you how to check if the rebalance is actually progressing, what phase it is in, and if there is a reason for concern.

Understanding the rebalance

As explained in the rebalancing act, the rebalance operation has three phases - planning, extents relocation and compacting. As far as the overall time to complete is concerned, the planing phase time is insignificant so there is no need to worry about it. The extent relocation phase will take most of the time, so the main focus will be on that.I will also show what is going on during the compacting phase.

It is important to know why the rebalance is running. If you are adding a new disk, say to increase the available disk group space, it doesn't really matter how long it will take for the rebalance to complete. OK maybe it does, if your database is hung because you ran out of space in your archive log destination. Similarly if you are resizing or dropping disk(s), to adjust the disk group space, you are generally not concerned with the time it takes for the rebalance to complete.

But if a disk has failed and ASM has initiated rebalance, there may be legitimate reason for concern. If your disk group is normal redundancy AND if another disk fails AND it's the partner of that disk that has already failed, your disk group will be dismounted, all your databases that use that disk group will crash and you may lose data. In such cases I understand that you want to know when that rebalance will complete. Actually, you want to see the relocation phase completed, as once it does, all your data is fully redundant again.

Extents relocation

To have a closer look at the extents relocation phase, I drop one of the disks with the default rebalance power:

SQL> show parameter power

NAME TYPE VALUE
------------------------------------ ----------- ----------------
asm_power_limit integer 1

SQL> set time on
16:40:57 SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06;

Diskgroup altered.

Initial estimated time to complete is 26 minutes:

16:41:21 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1        516      53736       2012          26
         4 REBAL WAIT          1

About 10 minutes into the rebalance, the estimate is 24 minutes:

16:50:25 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1      19235      72210       2124          24
         4 REBAL WAIT          1

While that EST_MINUTES doesn't give me much confidence, I see that the SOFAR (number of allocation units moved so far) is going up, which is a good sign.

ASM alert log shows the time of the drop disk, the OS process ID of the ARB0 doing all the work, and most importantly - that there are no errors:

Wed Jul 11 16:41:15 2012
SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
...
NOTE: starting rebalance of group 1/0x6ecaf3e6 (DATA1) at power 1
Starting background process ARB0
Wed Jul 11 16:41:24 2012
ARB0 started with pid=41, OS id=58591
NOTE: assigning ARB0 to group 1/0x6ecaf3e6 (DATA1) with 1 parallel I/O
NOTE: F1X0 copy 3 relocating from 0:2 to 55:35379 for diskgroup 1 (DATA1)
...

ARB0 trace file should show which file extents are being relocated. It does, and that is how I know that ARB0 is doing what it is supposed to do:

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_58591.trc
...
ARB0 relocating file +DATA1.282.788356359 (120 entries)
*** 2012-07-11 16:48:44.808
ARB0 relocating file +DATA1.283.788356383 (120 entries)
...
*** 2012-07-11 17:13:11.761
ARB0 relocating file +DATA1.316.788357201 (120 entries)
*** 2012-07-11 17:13:16.326
ARB0 relocating file +DATA1.316.788357201 (120 entries)
...

Note that there may be lot of arb0 trace files in the trace directory, so that's why we need to know the OS process ID of the ARB0 actually doing the rebalance. That information is in the alert log of the ASM instance performing the rebalance.

I can also look at the pstack of the ARB0 process to see what is going on. It does show me that ASM is relocating extents (key functions on the stack being kfgbRebalExecute - kfdaExecute - kffRelocate):

# pstack 58591
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003d4beaa in kfgbRebalExecute ()
#19 0x0000000003d39627 in kfgbDriver ()
#20 0x00000000020e8d23 in ksbabs ()
#21 0x0000000003d4faae in kfgbRun ()
#22 0x00000000020ed95d in ksbrdp ()
#23 0x0000000002322343 in opirip ()
#24 0x0000000001618571 in opidrv ()
#25 0x0000000001c13be7 in sou2o ()
#26 0x000000000083ceba in opimai_real ()
#27 0x0000000001c19b58 in ssthrdmain ()
#28 0x000000000083cda1 in main ()

After about 35 minutes the EST_MINUTES dropps to 0:

17:16:54 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN           1      74581      75825       2129           0
         3 REBAL WAIT          1
         4 REBAL WAIT          1

And soon after that, the ASM alert log shows:

Disk emptied
Disk header erased
PST update completed successfully
Disk closed
Rebalance completed

Wed Jul 11 17:17:32 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:17:41 2012
GMON updating for reconfiguration, group 1 at 20 for pid 38, osid 93832
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA1_CD_06_CELL06 emptied
NOTE: erasing header on grp 1 disk DATA1_CD_06_CELL06
NOTE: process _x000_+asm2 (93832) initiating offline of disk 0.3916039210 (DATA1_CD_06_CELL06) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 21 for pid 38, osid 93832
NOTE: PST update grp = 1 completed successfully
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x7e, op = clear
GMON updating disk modes for group 1 at 22 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: DATA1_CD_06_CELL06
NOTE: PST update grp = 1 completed successfully
GMON updating for reconfiguration, group 1 at 23 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: (not open) DATA1_CD_06_CELL06
NOTE: group 1 PST updated.
Wed Jul 11 17:17:41 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 24 for pid 19, osid 38421
GMON querying group 1 at 25 for pid 19, osid 38421
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

So the estimated time was 26 minutes and the rebalance actually took about 36 minutes (in this particular case the compacting took less than a minute so I have ignored it). That is why it is more important to understand what is going on, then to know when will the rebalance complete.

Note that the estimated time may also be increasing. If the system is under heavy load, the rebalance will take more time - especially with the rebalance power 1. For a large disk group (many TB) and large number of files, the rebalance can take hours and possibly days.

If you want to get an idea how long will a drop disk take in your environment, you need to test it. Just drop one of the disks, while your system is under normal/typical load. Your data is fully redundant during such disk drop, so you are not exposed to a disk group dismount in case its partner disk fails during the rebalance.

Compacting

In another example, to look at the compacting phase, I add the same disk back, with rebalance power 10:

17:26:48 SQL> alter diskgroup DATA1 add disk '/o/*/DATA1_CD_06_celll06' rebalance power 10;

Diskgroup altered.

Initial estimated time to complete is 6 minutes:

17:27:22 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN          10        489      53851       7920           6
         3 REBAL WAIT         10
         4 REBAL WAIT         10

After about 10 minutes, the EST_MINUTES drops to 0:

17:39:05 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         2 REBAL RUN          10      92407      97874       8716           0
         4 REBAL WAIT         10

And I see the following in the ASM alert log

Wed Jul 11 17:39:49 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:39:58 2012
GMON updating for reconfiguration, group 1 at 31 for pid 43, osid 115117
NOTE: group 1 PST updated.
Wed Jul 11 17:39:58 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 32 for pid 19, osid 38421
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

That means ASM has completed the second phase of the rebalance and is compacting now. If that is true, the pstack should show kfdCompact() function. Indeed it does:

# pstack 103326
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003c4b737 in kfdCompact ()
#19 0x0000000003c4c6d0 in kfdExecute ()
#20 0x0000000003d4bf0e in kfgbRebalExecute ()
#21 0x0000000003d39627 in kfgbDriver ()
#22 0x00000000020e8d23 in ksbabs ()
#23 0x0000000003d4faae in kfgbRun ()
#24 0x00000000020ed95d in ksbrdp ()
#25 0x0000000002322343 in opirip ()
#26 0x0000000001618571 in opidrv ()
#27 0x0000000001c13be7 in sou2o ()
#28 0x000000000083ceba in opimai_real ()
#29 0x0000000001c19b58 in ssthrdmain ()
#30 0x000000000083cda1 in main ()

The tail on ARB0 trace file now shows relocating just 1 entry at the time (another sign of compacting):

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_103326.trc
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
...

The V$ASM_OPERATION keeps showing EST_MINUTES=0 (compacting):

17:42:39 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         4 REBAL WAIT         10
         2 REBAL RUN          10      98271      98305       7919           0

The X$KFGMG shows REBALST_KFGMG=2 (compacting):

17:42:50 SQL> select NUMBER_KFGMG, OP_KFGMG, ACTUAL_KFGMG, REBALST_KFGMG from X$KFGMG;

NUMBER_KFGMG OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
1 1 10 2

Once the compacting phase completes, the alert log shows "stopping process ARB0" and "rebalance completed":

Wed Jul 11 17:43:48 2012
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)

In this case, the extents relocation took about 12 minutes and the compacting took about 4 minutes.

The compacting phase can actually take significant amount of time. In one case I have seen the extents relocation run for 60 minutes and the compacting after that took another 30 minutes. But it doesn't really matter how long it takes for the compacting to complete, because as soon as the second phase of the rebalance (extent relocation) completes, all data is fully redundant and we are not exposed to disk group dismount due to partner disk failure.

Changing the power

Rebalance power can be changed dynamically, i.e. during the rebalance, so if your rebalance with the default power is 'too slow', you can increase it. How much? Well, do you understand your I/O load, your I/O throughput and most importantly your limits? If not, increase the power to 5 (just run 'ALTER DISKGROUP ... REBALANCE POWER 5;') and see if it makes a difference. Give it 10-15 minutes, before you jump to conclusions. Should you go higher? Again, as long as you are not adversely impacting your database I/O performance, you can keep increasing the power. But I haven't seen much improvement beyond power 30.

The testing is the key here. You really need to test this under your regular load and in your production environment. There is no point testing with no databases running and on a system that runs off different storage system.

↧

Where is my data

October 27, 2012, 5:38 am

≫ Next: Identification of under-performing disks in Exadata

≪ Previous: When will my rebalance complete

Sometimes we want to know where exactly is a particular database block - on which ASM disk, in which allocation unit on that disk and in which block of that allocation unit. In this post I will show how to work that out.

Database instance

In the first part of this exercise I am logged into the database instance. Let's create a tablespace first.

SQL> create tablespace T1 datafile '+DATA';

Tablespace created.

SQL> select f.FILE#, f.NAME "File", t.NAME "Tablespace"

from V$DATAFILE f, V$TABLESPACE t

where t.NAME='T1' and f.TS# = t.TS#;

FILE# File Tablespace

----- ---------------------------------- ----------

6 +DATA/br/datafile/t1.272.797809075 T1

SQL>

Note the ASM file number is 272.

Let's now create a table and insert some data into it

SQL> create table TAB1 (n number, name varchar2(16)) tablespace T1;

Table created.

SQL> insert into TAB1 values (1, 'CAT');

1 row created.

SQL> commit;

Commit complete.

SQL>

Get the block number.

SQL> select ROWID, NAME from TAB1;

ROWID NAME

------------------ ----

AAASxxAAGAAAACHAAA CAT

SQL> select DBMS_ROWID.ROWID_BLOCK_NUMBER('AAASxxAAGAAAACHAAA') "Block number" from DUAL;

Block number

------------

135

SQL>

Get the block size for the datafile.

SQL> select BLOCK_SIZE from V$DATAFILE where FILE#=6;

BLOCK_SIZE

----------

8192

SQL>

From the above I see that the data is in block 135 and that the block size is 8KB.

ASM instance

I now connect to the ASM instance and first check the extent distributions for ASM datafile 272.

SQL> select GROUP_NUMBER from V$ASM_DISKGROUP where NAME='DATA';

GROUP_NUMBER

------------

SQL> select PXN_KFFXP, -- physical extent number

XNUM_KFFXP, -- virtual extent number

DISK_KFFXP, -- disk number

AU_KFFXP -- allocation unit number

from X$KFFXP

where NUMBER_KFFXP=272 -- ASM file 272

AND GROUP_KFFXP=1 -- group number 1

order by 1;

PXN_KFFXP XNUM_KFFXP DISK_KFFXP AU_KFFXP

---------- ---------- ---------- ----------

0 0 0 1175

1 0 3 1170

2 1 3 1175

3 1 2 1179

4 2 1 1175

...

SQL>

As expected, the file extents are spread over all disks and each (physical) extent is mirrored, as this file is normal redundancy. Note that I said the file is normal redundancy. By default the file inherits the disk group redundancy. The controlfile is an exception, as it gets created as high redundancy, even in the normal redundancy disk group - if the disk group has at least three failgroups.

I also need to know the ASM allocation unit size for this disk group.

SQL> select VALUE from V$ASM_ATTRIBUTE where NAME='au_size' and GROUP_NUMBER=1;

VALUE

-------

1048576

SQL>

The allocation unit size is 1MB. Note that each disk group can have a different allocation unit size.

Where is my block

I know my data is in block 135 of ASM file 272. With the block size of 8K each allocation unit can hold 128 blocks (1MB/8KB=128). That means the block 135 is 7th (135-128=7) in the second virtual extent. The second virtual extent consists of allocation unit 1175 on disk 3 and allocation unit 1179 on disk 2, as per the select from X$KFFXP.

Let's get the names of disks 2 and 3.

SQL> select DISK_NUMBER, NAME

from V$ASM_DISK

where DISK_NUMBER in (2,3);

DISK_NUMBER NAME

----------- ------------------------------

2 ASMDISK3

3 ASMDISK4

SQL>

I am using ASMLIB, so at the OS level, those disks are /dev/oracleasm/disks/ASMDISK3 and /dev/oracleasm/disks/ASMDISK4.

Show me the money

Let's recap. My data is 7 blocks into the allocation unit 1175. That allocation unit is 1175 MB into the disk /dev/oracleasm/disks/ASMDISK4.

Let's first extract that allocation unit.

$ dd if=/dev/oracleasm/disks/ASMDISK4 bs=1024k count=1 skip=1175 of=AU1175.dd

1+0 records in

1+0 records out

1048576 bytes (1.0 MB) copied, 0.057577 seconds, 18.2 MB/s

$ ls -l AU1175.dd

-rw-r--r-- 1 grid oinstall 1048576 Oct 27 22:45 AU1175.dd

Note the arguments to the dd command:

bs=1024k - allocation unit size
skip=1175 - allocation unit I am interested in
count=1 - I only need one allocation unit

Let's now extract block 7 out of that allocation unit.

$ dd if=AU1175.dd bs=8k count=1 skip=7 of=block135.dd

Note the arguments to the dd command now - bs=8k (data block size) and skip=7 (block I am interested in).

Let's now look at that block.

$ od -c block135.dd

...

0017760 \0 \0 , 001 002 002 301 002 003 C A T 001 006 332 217

0020000

At the bottom of that block I see my data (CAT). Remember that Oracle blocks are populated from the bottom up.

Note that I would see the same if I looked at the allocation unit 1179 on disk /dev/oracleasm/disks/ASMDISK3.

Conclusion

To locate an Oracle data block in ASM, I had to know in which datafile that block was stored. I then queried X$KFFXP in ASM to see the extent distribution for that datafile. I also had to know both the datafile block size and ASM allocation unit size, to work out in which allocation unit my block was.

None of this is ASM or RDBMS version specific (except the query from V$ASM_ATTRIBUTE, as there is no such view in 10g). The ASM disk group redundancy is also irrelevant. Of course, with normal and high redundancy we will have multiple copies of data, but the method to find the data location is exactly the same for all types of disk group redundancy.

↧

Identification of under-performing disks in Exadata

May 19, 2013, 2:29 am

≫ Next: Auto disk management feature in Exadata

≪ Previous: Where is my data

Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.

About storage cell software processes

The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.

The Management Server (MS) provides storage cell management and configuration tasks.

Disk state changes

Possibly under-performing - confined online

When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.

That disk status change would be associated with the following entry in the storage cell alerthistory:

[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"

At the same time, the following will be logged in the storage cell alert log:

CDHS: Mark cd health state change [cell disk name] with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 11912
...

Prepare for test - confined offline

The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE

Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.

In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.

Under stress test

Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.

Poor performance - drop force

If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com, ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com, ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com, ClientPID = 26502
CDHS: Done cd health state change from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL

In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.

Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.

All well - back to normal

If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.

Disk confinement triggers

Any of the following conditions can trigger a disk confinement:

Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
Errors, e.g. I/O errors (CD_PERF_IOERR).

Conclusion

As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.

I have recently published this on MOS as Doc ID 1509105.1.

↧

Auto disk management feature in Exadata

June 13, 2013, 3:10 am

≫ Next: How many allocation units per file

≪ Previous: Identification of under-performing disks in Exadata

The automatic disk management feature is about automating ASM disk operations in an Exadata environment. The automation functionality applies to both planned actions (for example, deactivating griddisks in preparation for storage cell patching) and unplanned events (for example, disk failure).

Exadata disks

In an Exadata environment we have the following disk types:

Physicaldisk is a hard disk on a storage cell. Each storage cell has 12 physical disks, all with the same capacity (600 GB, 2 TB or 3 TB).
Flashdisk is a Sun Flash Accelerator PCIe solid state disk on a storage cell. Each storage cell has 16 flashdisks - 24 GB each in X2 (Sun Fire X4270 M2) and 100 GB each in X3 (Sun Fire X4270 M3) servers.
Celldisk is a logical disk created on every physicaldisk and every flashdisk on a storage cell. Celldisks created on physicaldisks are named CD_00_cellname, CD_01_cellname ... CD_11_cellname. Celldisks created on flashdisks are named FD_00_cellname, FD_01_cellname ... FD_15_cellname.
Griddisk is a logical disk that can be created on a celldisk. In a standard Exadata deployment we create griddisks on hard disk based celldisks only. While it is possible to create griddisks on flashdisks, this is not a standard practice; instead we use flash based celldisks for the flashcashe and flashlog.
ASM disk in an Exadata environment is a griddisk.

Automated disk operations

These are the disk operations that are automated in Exadata:

1. Griddisk status change to OFFLINE/ONLINE

If a griddisk becomes temporarily unavailable, it will be automatically OFFLINED by ASM. When the griddisk becomes available, it will be automatically ONLINED by ASM.

2. Griddisk DROP/ADD

If a physicaldisk fails, all griddisks on that physicaldisk will be DROPPED with FORCE option by ASM. If a physicaldisk status changes to predictive failure, all griddisks on that physical disk will be DROPPED by ASM. If a flashdisk performance degrades, the corresponding griddisks (if any) will be DROPPED with FORCE option by ASM.

When a physicaldisk is replaced, the celldisk and griddisks will be recreated by CELLSRV, and the griddisks will be automatically ADDED by ASM.

NOTE: If a griddisk in NORMAL state and in ONLINE mode status, is manually dropped with FORCE option (for example, by a DBA with 'alter diskgroup ... drop disk ... force'), it will be automatically added back by ASM. In other words, dropping a healthy disk with a force option will not achieve the desired effect.

3. Griddisk OFFLINE/ONLINE for rolling Exadata software (storage cells) upgrade

Before the rolling upgrade all griddisks will be inactivated on the storage cell by CELLSRV and OFFLINED by ASM. After the upgrade all griddisks will be activated on the storage cell and ONLINED in ASM.

4. Manual griddisk activation/inactivation

If a gridisk is manually inactivated on a storage cell, by running 'cellcli -e alter griddisk ... inactive', it will be automatically OFFLINED by ASM. When a gridisk is activated on a storage cell, it will be automatically ONLINED by ASM.

5. Griddisk confined ONLINE/OFFLINE

If a griddisk is taken offline by CELLSRV, because the underlying disk is suspected for poor performance, all griddisks on that celldisk will be automatically OFFLINED by ASM. If the tests confirm that the celldisk is performing poorly, ASM will drop all griddisks on that celldisk. If the tests find that the disk is actually fine, ASM will online all griddisks on that celldisk.

Software components

1. Cell Server (CELLSRV)

The Cell Server (CELLSRV) runs on the storage cell and it's the main component of Exadata software. In the context of automatic disk management, its tasks are to process the Management Server notifications and handle ASM queries about the state of griddisks.

2. Management Server (MS)

The Management Server (MS) runs on the storage cell and implements a web service for cell management commands, and runs background monitoring threads. The MS monitors the storage cell for hardware changes (e.g. disk plugged in) or alerts (e.g. disk failure), and notifies the CELLSRV about those events.

3. Automatic Storage Management (ASM)

The Automatic Storage Management (ASM) instance runs on the compute (database) node and has two processes that are relevant to the automatic disk management feature:

Exadata Automation Manager (XDMG) initiates automation tasks involved in managing Exadata storage. It monitors all configured storage cells for state changes, such as a failed disk getting replaced, and performs the required tasks for such events. Its primary tasks are to watch for inaccessible disks and cells and when they become accessible again, to initiate the ASM ONLINE operation.
Exadata Automation Manager (XDWK) performs automation tasks requested by XDMG. It gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Working together

All three software components work together to achieve automatic disk management.

In the case of disk failure, the MS detects that the disk has failed. It then notifies the CELLSRV about it. If there are griddisks on the failed disk, the CELLSRV notifies ASM about the event. ASM then drops all griddisks from the corresponding disk groups.

In the case of a replacement disk inserted into the storage cell, the MS detects the new disk and checks the cell configuration file to see if celldisk and griddisks need to be created on it. If yes, it notifies the CELLSRV to do so. Once finished, the CELLSRV notifies ASM about new griddisks and ASM then adds them to the corresponding disk groups.

In the case of a poorly performing disk, the CELLSRV first notifies ASM to offline the disk. If possible, ASM then offlines the disk. One example when ASM would refuse to offline the disk, is when a partner disk is already offline. Offlining the disk would result in the disk group dismount, so ASM would not do that. Once the disk is offlined by ASM, it notifies the CELLSRV that the performance tests can be carried out. Once done with the tests, the CELLSRV will either tell ASM to drop that disk (if it failed the tests) or online it (if it passed the test).

The actions by MS, CELLSRV and ASM are coordinated in a similar fashion, for other disk events.

ASM initialization parameters

The following are the ASM initialization parameters relevant to the auto disk management feature:

_AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature set this parameter to FALSE. Range of values: TRUE [default] or FALSE.
_AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic operation. Range of values: 1-10. Default value is 2.
_AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk. Range of values: 1-10. Default value is 3.

All three parameters are static, which means they require ASM instances restart. Note that all these are hidden (underscore) parameters that should not be modified unless advised by Oracle Support.

Files

The following are the files relevant to the automatic disk management feature:

1. Cell configuration file - $OSSCONF/cell_disk_config.xml. An XML file on the storage cell that contains information about all configured objects (storage cell, disks, IORM plans, etc) except alerts and metrics. The CELLSRV reads this file during startup and writes to it when an object is updated (e.g. updates to IORM plan).

2. Grid disk file - $OSSCONF/griddisk.owners.dat. A binary file on the storage cell that contains the following information for all griddisks:

ASM disk name
ASM disk group name
ASM failgroup name
Cluster identifier (which cluster this disk belongs to)
Requires DROP/ADD (should the disk be dropped from or added to ASM)

3. MS log and trace files - ms-odl.log and ms-odl.trc in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

4. CELLSRV alert log - alert.log in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

5. ASM alert log - alert_+ASMn.log in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

6. XDMG and XDWK trace files - +ASMn_xdmg_nnnnn.trc and +ASMn_xdwk_nnnnn.trc in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

Conclusion

In an Exadata environment, the ASM has been enhanced to provide the automatic disk management functionality. Three software components that work together to provide this facility are the Exadata Cell Server (CELLSRV), Exadata Management Server (MS) and Automatic Storage Management (ASM).

I have also published this via MOS as Doc ID 1484274.1.

↧

How many allocation units per file

June 16, 2013, 4:30 am

≫ Next: ASM version 12c is out

≪ Previous: Auto disk management feature in Exadata

This post is about the amount of space allocated to ASM based files.

The smallest amount of space ASM allocates is an allocation unit (AU). The default AU size is 1 MB, except in Exadata where the default AU size is 4 MB.

The space for ASM based files is allocated in extents, which consist of one or more AUs. In version 11.2, the first 20000 extents consist of 1 AU, next 20000 extents have 4 AUs, and extents beyond that have 16 AUs. This is known as variable size extent feature. In version 11.1, the extent growth was 1-8-64 AUs. In version 10, we don't have variable size extents, so all extents sizes are exactly 1 AU.

Bytes vs space

The definition for V$ASM_FILE view says the following for BYTES and SPACE columns:

BYTES - Number of bytes in the file
SPACE - Number of bytes allocated to the file

There is a subtle difference in the definition and very large difference in numbers. Let's have a closer look. For the examples in this post I will use database and ASM version 11.2.0.3, with ASMLIB based disks.

First get some basic info about disk group DATA where most of my datafiles are. Run the following SQL connected to the database instance.

SQL> select NAME, GROUP_NUMBER, ALLOCATION_UNIT_SIZE/1024/1024 "AU size (MB)", TYPE
from V$ASM_DISKGROUP
where NAME='DATA';

NAME GROUP_NUMBER AU size (MB) TYPE
---------------- ------------ ------------ ------
DATA 1 1 NORMAL

Now create one small file (under 60 extents) and one large file (over 60 extents).

SQL> create tablespace T1 datafile '+DATA' size 10 M;

Tablespace created.

SQL> create tablespace T2 datafile '+DATA' size 100 M;

Tablespace created.

Get the ASM file numbers for those two files:

SQL> select NAME, round(BYTES/1024/1024) "MB" from V$DATAFILE;

NAME MB
------------------------------------------ ----------
...
+DATA/br/datafile/t1.272.818281717 10
+DATA/br/datafile/t2.271.818281741 100

The small file is ASM file number 272 and the large file is ASM file number 271.

Get the bytes and space information (in AUs) for these two files.

SQL> select FILE_NUMBER, round(BYTES/1024/1024) "Bytes (AU)", round(SPACE/1024/1024) "Space (AUs)", REDUNDANCY
from V$ASM_FILE
where FILE_NUMBER in (271, 272) and GROUP_NUMBER=1;

FILE_NUMBER Bytes (AU) Space (AUs) REDUND
----------- ---------- ----------- ------
272 10 22 MIRROR
271 100 205 MIRROR

The bytes shows the actual file size. For the small file, bytes shows the file size is 10 AUs = 10 MB (the AU size is 1 MB). The space required for the small file is 22 AUs. 10 AUs for the actual datafile, 1 AU for the file header and because the file is mirrored, double that, so 22 AUs in total.

For the large file, bytes shows the file size is 100 AUs = 100 MB. So far so good. But the space required for the large file is 205 AUs, not 202 as one might expect. What are those extra 3 AUs for? Let's find out.

ASM space

The following query (run in ASM instance) will show us the extent distribution for ASM file 271.

SQL> select XNUM_KFFXP "Virtual extent", PXN_KFFXP "Physical extent", DISK_KFFXP "Disk number", AU_KFFXP "AU number"
from X$KFFXP
where GROUP_KFFXP=1 and NUMBER_KFFXP=271
order by 1,2;

Virtual extent Physical extent Disk number AU number
-------------- --------------- ----------- ----------
0 0 3 1155
0 1 0 1124
1 2 0 1125
1 3 2 1131
2 4 2 1132
2 5 0 1126
...
100 200 3 1418
100 201 1 1412
2147483648 0 3 1122
2147483648 1 0 1137
2147483648 2 2 1137

205 rows selected.

As the file is mirrored, we see that each virtual extent has two physical extents. But the interesting part of the result are the last three allocation units for virtual extent number 2147483648, that is triple mirrored. We will have a closer look at those with kfed, and for that we will need disk names.

Get the disk names.

SQL> select DISK_NUMBER, PATH
from V$ASM_DISK
where GROUP_NUMBER=1;

DISK_NUMBER PATH
----------- ---------------
0 ORCL:ASMDISK1
1 ORCL:ASMDISK2
2 ORCL:ASMDISK3
3 ORCL:ASMDISK4

Let's now check what type of data is in those allocation units.

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=1122 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK1 aun=1137 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1137 | grep type
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT

These additional allocation units hold ASM metadata for the large file. More specifically they hold extent map information that could not fit into the the ASM file directory block. The file directory needs extra space to keep track of files larger than 60 extents, so it needs an additional allocation unit to do so. While the file directory needs only few extra ASM metadata blocks, the smallest unit of space the ASM can allocate is an AU. And because this is metadata, this AU is triple mirrored (even in a normal redundancy disk group), hence 3 extra allocation units for the large file. In an external redundancy disk group, there would be only one extra AU per large file.

Conclusion

The amount of space ASM needs for a file, depends on two factors - the file size and the disk group redundancy.

In an external redundancy disk group, the required space will be the file size + 1 AU for the file header + 1 AU for indirect extents if the file is larger than 60 AUs.

In a normal redundancy disk group, the required space will be twice the file size + 2 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

In a high redundancy disk group, the required space will be three times the file size + 3 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

↧

ASM version 12c is out

August 14, 2013, 3:41 am

≫ Next: Physical metadata replication

≪ Previous: How many allocation units per file

Oracle Database version 12c has been released, which means a brand new version of ASM is out! Notable new features are Flex ASM, proactive data validation and better handling of disk management operations. Let's have an overview with more details in separate posts.

Flex ASM

No need to run ASM instances on all nodes in the cluster. In a default installation there would be three ASM instances, irrespective of the number of nodes in the cluster. An ASM instance can serve both local and remote databases. If an ASM instance fails, the database instances do not crash; instead they fail over to another ASM instance in the cluster.

Flex ASM introduces new instance type - an I/O server or ASM proxy instance. There will be a few (default is 3) I/O server instances in Oracle flex cluster environment, serving indirect clients (typically an ACFS cluster file system). An I/O server instance can run on the same node as ASM instance or on a different node in a flex cluster. In all cases, an I/O server instance needs to talk to a flex ASM instance to get metadata information on behalf of an indirect client.

The flex ASM is an optional feature in 12c.

Physical metadata replication

In addition to replicating the disk header (available since 11.1.0.7), ASM 12c also replicates the allocation table, within each disk. This makes ASM more resilient to bad disk sectors and external corruptions. The disk group attribute PHYS_META_REPLICATED is provided to track the replication status of a disk group.

$ asmcmd lsattr -G DATA -l phys_meta_replicated
NameValue
phys_meta_replicatedtrue

The physical metadata replication status flag is in the disk header (kfdhdb.flags). This flag only ever goes from 0 to 1 (once the physical metadata has been replicated) and it never goes back to 0.

More storage

ASM 12c supports 511 disk groups, with the maximum disk size of 32 PB.

Online with power

ASM 12c has a fast mirror resync power limit to control resync parallelism and improve performance. Disk resync checkpoint functionality provides faster recovery from instance failures by enabling the resync to resume from the point at which the process was interrupted or stopped, instead of starting from the beginning. ASM 12c also provides a time estimate for the completion of a resync operation.

Use power limit for disk resync operations, similar to disk rebalance, with the range from 1 to 1024:

$ asmcmd online -G DATA -D DATA_DISK1 --power 42

Disk scrubbing - proactive data validation and repair

In ASM 12c the disk scrubbing checks for data corruptions and repairs them automatically in normal and high redundancy disk groups. This is done during disk group rebalance if a disk group attribute CONTENT.CHECK is set to TRUE. The check can also be performed manually by running ALTER DISKGROUP SCRUB command.

The scrubbing can be performed at the disk group, disk or a file level and can be monitored via V$ASM_OPERATION view.

Even read for disk groups

In previous ASM versions, the data was always read from the primary copy (in a normal or high redundancy disk groups) unless a preferred failgroup was set up. The data from the mirror would be read only if the primary copy of the data was unavailable. With the even read feature, each request to read can be sent to the least loaded of the possible source disks. The least loaded in this context is simply the disk with the least number of read requests.

Even read functionality is enabled by default on all Oracle Database and Oracle ASM instances of version 12.1 and higher in non-Exadata environments. The functionality is enabled in an Exadata environment when there is a failure. Even read functionality is applicable only to disk groups with normal or high redundancy.

Replace an offline disk

We now have a new ALTER DISKGROUP REPLACE DISK command, that is a mix of the rebalance and fast mirror resync functionality. Instead of a full rebalance, the new, replacement disk, is populated with data read from the surviving partner disks only. This effectively reduces the time to replace a failed disk.

Note that the disk being replaced must be in OFFLINE state. If the disk offline timer has expired, the disk is dropped, which initiates the rebalance. On a disk add, there will be another rebalance.

ASM password file in a disk group

ASM version 11.2 allowed ASM spfile to be placed in a disk group. In 12c we can also put ASM password file in an ASM disk group. Unlike ASM spfile, the access to the ASM password file is possible only after ASM startup and once the disk group containing the password is mounted.

The orapw utility now accepts ASM disk group as a password destination. The asmcmd has also been enhanced to allow ASM password management.

Failgroup repair timer

We now have a failgroup repair timer with the default value of 24 hours. Note that the disk repair timer still defaults to 3.6 hours.

Rebalance rebalanced

The rebalance work is now estimated based on the detailed work plan, that can be generated and viewed separately. We now have a new EXPLAIN WORK command and a new V$ASM_ESTIMATE view.

In ASM 12c we (finally) have a priority ordered rebalance - the critical files (typically control files and redo logs) are rebalanced before other database files.

In Exadata, the rebalance can be offloaded to storage cells.

Thin provisioning support

ASM 12c enables thin provisioning support for some operations (that are typically associated with the disk group rebalance). The feature is disabled by default, and can be enabled at the disk group creation time or later by setting disk group attribute THIN_PROVISIONED to TRUE.

Enhanced file access control (ACL)

Easier file ownership and permission changes, e.g. a file permission can be changed on an open file. ACL has also been implemented for Microsoft Windows OS.

Oracle Cluster Registry (OCR) backup in ASM disk group

Storing the OCR backup in an Oracle ASM disk group simplifies OCR management by permitting access to the OCR backup from any node in the cluster should an OCR recovery become necessary.

Use ocrconfig command to specify an OCR backup location in an Oracle ASM disk group:

# ocrconfig –backuploc +DATA

↧

Physical metadata replication

August 17, 2013, 3:51 am

≫ Next: Free Space Table

≪ Previous: ASM version 12c is out

Starting with version 12.1, ASM replicates the physically addressed metadata. This means that ASM maintains two copies of the disk header, the Free Space Table and the Allocation Table data. Note that this metadata is not mirrored, but replicated. ASM mirroring refers to copies of the same data on different disks. The copies of the physical metadata are on the same disk, hence the term replicated. This also means that the physical metadata is replicated even in an external redundancy disk group.

The Partnership and Status Table (PST) is also referred to as physically addressed metadata, but the PST is not replicated. This is because the PST is protected by mirroring - in normal and high redundancy disk groups.

Where is the replicated metadata

The physically addressed metadata is in allocation unit 0 (AU0) on every ASM disk. With this feature enabled, ASM will copy the contents of AU0 into allocation unit 11 (AU11), and from that point on, it will maintain both copies. This feature will be automatically enabled when a disk group is created with ASM compatibility of 12.1 or higher, or when ASM compatibility is advanced to 12.1 or higher, for an existing disk group.

If there is data in AU11, when the ASM compatibility is advanced to 12.1 or higher, ASM will simply move that data somewhere else, and use AU11 for the physical metadata replication.

Since version 11.1.0.7, ASM keeps a copy of the disk header in the second last block of AU1. Interestingly, in version 12.1, ASM still keeps the copy of the disk header in AU1, which means that now every ASM disk will have three copies of the disk header block.

Disk group attribute PHYS_META_REPLICATED

The status of the physical metadata replication can be checked by querying the disk group attribute PHYS_META_REPLICATED. Here is an example with the asmcmd command that shows how to check the replication status for disk group DATA:

$ asmcmd lsattr -G DATA -l phys_meta_replicated
NameValue
phys_meta_replicatedtrue

The phys_meta_replicated=true means that the physical metadata for disk group DATA has been replicated.

The kfdhdb.flags field in the ASM disk header indicates the status of the physical metadata replication as follows:

kfdhdb.flags = 0 - no physical data has been replicated
kfdhdb.flags = 1 - physical data has been replicated
kfdhdb.flags = 2 - physical data replication in progress

Once the flag is set to 1, it will never go back to 0.

Metadata replication in action

As stated earlier, the physical metadata will be replicated in disk groups with ASM compatibility of 12.1 or higher. Let's first have a look at a disk group with ASM compatible set to 12.1:

$ asmcmd lsattr -G DATA -l compatible.asm
Name Value
compatible.asm 12.1.0.0.0
$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

This shows that the physical metadata has been replicated. Now verify that all disks in the disk group have the kfdhdb.flags set to 1:

$ for disk in `asmcmd lsdsk -G DATA --suppressheader`; do kfed read $disk | egrep "dskname|flags"; done
kfdhdb.dskname: DATA_0000 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0001 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0002 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001
kfdhdb.dskname: DATA_0003 ; 0x028: length=9
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

This shows that all disks have the replication flag set to 1, i.e. that the physical metadata has been replicated for all disks in the disk group.

Let's now have a look at a disk group with ASM compatibility 11.2, that is later advanced to 12.1:

SQL> create diskgroup DG1 external redundancy
2 disk '/dev/sdi1'
3 attribute 'COMPATIBLE.ASM'='11.2';

Diskgroup created.

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name Value

Nothing - no such attribute. That is because the ASM compatibility is less than 12.1. We also expect that the kfdhdb.flags is 0 for the only disk in that disk group:

$ kfed read /dev/sdi1 | egrep "type|dskname|grpname|flags"
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.grpname: DG1 ; 0x048: length=3
kfdhdb.flags: 0 ; 0x0fc: 0x00000000

Let's now advance the ASM compatibility to 12.1:

$ asmcmd setattr -G DG1 compatible.asm 12.1.0.0.0

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name Value
phys_meta_replicated true

The physical metadata has been replicated, so we should now see the kfdhdb.flags set to 1:

$ kfed read /dev/sdi1 | egrep "dskname|flags"
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

The physical metadata should be replicated in AU11:

$ kfed read /dev/sdi1 aun=11 | egrep "type|dskname|flags"
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname: DG1_0000 ; 0x028: length=8
kfdhdb.flags: 1 ; 0x0fc: 0x00000001

$ kfed read /dev/sdi1 aun=11 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
$ kfed read /dev/sdi1 aun=11 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

This shows that the AU11 has the copy of the data from AU0.

Finally check for the disk header copy in AU1:

$ kfed read /dev/sdi1 aun=1 blkn=254 | grep type
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD

This shows that there is also a copy of the disk header in the second last block of AU1.

Conclusion

ASM version 12 replicates the physically addressed metadata, i.e. it keeps the copy of AU0 in AU11 - on the same disk. This allows ASM to automatically recover from damage to any data in AU0. Note that ASM will not be able to recover from loss of any other data in an external redundancy disk group. In a normal redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in a single failgroup. In a high redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in any two failgroups.

↧

Free Space Table

August 23, 2013, 1:01 am

≫ Next: Allocation Table

≪ Previous: Physical metadata replication

The ASM Free Space Table (FST) provides a summary of which allocation table blocks have free space. It contains an array of bit patterns indexed by allocation table block number. The table is used to speed up the allocation of new allocation units by avoiding reading blocks that are full.

The FST is technically part of the Allocation Table (AT), and is at block 1 of the AT. The Free Space Table, and the Allocation Table are so called physically addressed metadata, as they are always at the fixed location on each ASM disk.

Locating the Free Space Table

The location of the FST block is stored in the ASM disk header (field kfdhdb.fstlocn). In the following example, the lookup of that field in the disk header, shows that the FST is in block 1.

$ kfed read /dev/sdc1 | grep kfdhdb.fstlocn
kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001

Let’s have a closer look at the FST:

$ kfed read /dev/sdc1 blkn=1 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
...
kfdfsb.aunum: 0 ; 0x000: 0x00000000
kfdfsb.max: 254 ; 0x004: 0x00fe
kfdfsb.cnt: 254 ; 0x006: 0x00fe
kfdfsb.bound: 0 ; 0x008: 0x0000
kfdfsb.flag: 1 ; 0x00a: B=1
kfdfsb.ub1spare: 0 ; 0x00b: 0x00
kfdfsb.spare[0]: 0 ; 0x00c: 0x00000000
kfdfsb.spare[1]: 0 ; 0x010: 0x00000000
kfdfsb.spare[2]: 0 ; 0x014: 0x00000000
kfdfse[0].fse: 119 ; 0x018: FREE=0x7 FRAG=0x7
kfdfse[1].fse: 16 ; 0x019: FREE=0x0 FRAG=0x1
kfdfse[2].fse: 16 ; 0x01a: FREE=0x0 FRAG=0x1
kfdfse[3].fse: 16 ; 0x01b: FREE=0x0 FRAG=0x1
...
kfdfse[4037].fse: 0 ; 0xfdd: FREE=0x0 FRAG=0x0
kfdfse[4038].fse: 0 ; 0xfde: FREE=0x0 FRAG=0x0
kfdfse[4039].fse: 0 ; 0xfdf: FREE=0x0 FRAG=0x0

For this FST block, the first allocation table block is in AU 0:

kfdfsb.aunum: 0 ; 0x000: 0x00000000

Maximum number of the FST entries this block can hold is 254:

kfdfsb.max: 254 ; 0x004: 0x00fe

How many Free Space Tables

Large ASM disks may have more than one stride. The field kfdhdb.mfact in the ASM disk header, shows the stride size - expressed in allocation units. Each stride will have its own physically addressed metadata, which means that it will have its own Free Space Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the FST for the second stride. That should be in block 1 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC

As expected, we have another FTS in AU113792. If we had another stride, there would be another FST at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the FST at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=1 | grep type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC

Conclusion

The Free Space Table is in block 1 of allocation unit 0 of every ASM disks. If the disk has more than one stride, each stride will have its own Free Space Table.

↧

Allocation Table

August 24, 2013, 4:05 am

≫ Next: Partnership and Status Table

≪ Previous: Free Space Table

Every ASM disk contains at least one Allocation Table (AT) that describes the contents of the disk. The AT has one entry for every allocation unit (AU) on the disk. If an AU is allocated, the Allocation Table will have the extent number and the file number the AU belongs to.

Finding the Allocation Table

The location of the first block of the Allocation Table is stored in the ASM disk header (field kfdhdb.altlocn). In the following example, the look up of that field shows that the AT starts at block 2.

$ kfed read /dev/sdc1 | grep kfdhdb.altlocn
kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002

Let’s have a closer look at the first block of the Allocation Table.

$ kfed read /dev/sdc1 blkn=2 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum: 0 ; 0x000: 0x00000000
kfdatb.shrink: 448 ; 0x004: 0x01c0
...

The kfdatb.aunum=0, means that AU0 is the first AU described by this AT block. The kfdatb.shrink=448 means that this AT block can hold the information for 448 AUs. In the next AT block we should see kfdatb.aunum=448, meaning that it will have the info for AU448 + 448 more AUs. Let’s have a look:

$ kfed read /dev/sdc1 blkn=3 | grep kfdatb.aunum
kfdatb.aunum: 448 ; 0x000: 0x000001c0

The next AT block should show kfdatb.aunum=896:

$ kfed read /dev/sdc1 blkn=4 | grep kfdatb.aunum
kfdatb.aunum: 896 ; 0x000: 0x00000380

And so on...

Allocation table entries

For allocated AUs, the Allocation Table entry (kfdate[i]) holds the extent number, file number and the state of the allocation unit - normally allocated (flag V=1), vs a free or unallocated AU (flag V=0).

Let’s have a look at Allocation Table block 3.

$ kfed read /dev/sdc1 blkn=3 | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum: 448 ; 0x000: 0x000001c0
...
kfdate[142].discriminator: 1 ; 0x498: 0x00000001
kfdate[142].allo.lo: 0 ; 0x498: XNUM=0x0
kfdate[142].allo.hi: 8388867 ; 0x49c: V=1 I=0 H=0 FNUM=0x103
kfdate[143].discriminator: 1 ; 0x4a0: 0x00000001
kfdate[143].allo.lo: 1 ; 0x4a0: XNUM=0x1
kfdate[143].allo.hi: 8388867 ; 0x4a4: V=1 I=0 H=0 FNUM=0x103
kfdate[144].discriminator: 1 ; 0x4a8: 0x00000001
kfdate[144].allo.lo: 2 ; 0x4a8: XNUM=0x2
kfdate[144].allo.hi: 8388867 ; 0x4ac: V=1 I=0 H=0 FNUM=0x103
kfdate[145].discriminator: 1 ; 0x4b0: 0x00000001
kfdate[145].allo.lo: 3 ; 0x4b0: XNUM=0x3
kfdate[145].allo.hi: 8388867 ; 0x4b4: V=1 I=0 H=0 FNUM=0x103
kfdate[146].discriminator: 1 ; 0x4b8: 0x00000001
kfdate[146].allo.lo: 4 ; 0x4b8: XNUM=0x4
kfdate[146].allo.hi: 8388867 ; 0x4bc: V=1 I=0 H=0 FNUM=0x103
kfdate[147].discriminator: 1 ; 0x4c0: 0x00000001
kfdate[147].allo.lo: 5 ; 0x4c0: XNUM=0x5
kfdate[147].allo.hi: 8388867 ; 0x4c4: V=1 I=0 H=0 FNUM=0x103
kfdate[148].discriminator: 0 ; 0x4c8: 0x00000000
kfdate[148].free.lo.next: 16 ; 0x4c8: 0x0010
kfdate[148].free.lo.prev: 16 ; 0x4ca: 0x0010
kfdate[148].free.hi: 2 ; 0x4cc: V=0 ASZM=0x2
kfdate[149].discriminator: 0 ; 0x4d0: 0x00000000
kfdate[149].free.lo.next: 0 ; 0x4d0: 0x0000
kfdate[149].free.lo.prev: 0 ; 0x4d2: 0x0000
kfdate[149].free.hi: 0 ; 0x4d4: V=0 ASZM=0x0
...

The excerpt shows the Allocation Table entries for file 259 (hexadecimal FNUM=0x103), which start at kfdate[142] and end at kfdate[147]. That shows the ASM file 259 has the total of 6 AUs. The AU numbers will be the index of kfdate[i] + offset (kfdatb.aunum=448). In other words, 142+448=590, 143+448=591 ... 147+448=595. Let's verify that by querying X$KFFXP:

SQL> select AU_KFFXP
from X$KFFXP
where GROUP_KFFXP=1 -- disk group 1
and NUMBER_KFFXP=259 -- file 259
;

AU_KFFXP
----------
590
591
592
593
594
595

6 rows selected.

Free space

In the above kfed output, we see that kfdate[148] and kfdate[149] have the word free next to them, which marks them as free or unallocated allocation units (flagged with V=0). That kfed output is truncated, but there are many more free allocation units described by this AT block.

The stride

Each AT block can describe 448 AUs (the kfdatb.shrink value from the Allocation Table), and the whole AT can have 254 blocks (the kfdfsb.max value from the Free Space Table). This means that one Allocation Table can describe 254x448=113792 allocation units. This is called the stride, and the stride size - expressed in number of allocation units - is in the field kfdhdb.mfact, in ASM disk header:

$ kfed read /dev/sdc1 | grep kfdhdb.mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

The stride size in this example is for the AU size of 1MB, that can fit 256 metadata blocks in AU0. Block 0 is for the disk header and block 1 is for the Free Space Table, which leaves 254 blocks for the Allocation Table blocks.

With the AU size of 4MB (default in Exadata), the stride size will be 454272 allocation units or 1817088 MB. With the larger AU size, the stride will also be larger.

How many Allocation Tables

Large ASM disks may have more than one stride. Each stride will have its own physically addressed metadata, which means that it will have its own Allocation Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the AT entries for the second stride. Those should be in blocks 2-255 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
...
$ kfed read /dev/sdc1 aun=113792 blkn=255 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

As expected, we have another AT in AU113792. If we had another stride, there would be another AT at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the AT at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=2 | grep type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL

Conclusion

Every ASM disk contains at least one Allocation Table that describes the contents of the disk. The AT has one entry for every allocation unit on the disk. If the disk has more than one stride, each stride will have its own Allocation Table.

↧

Partnership and Status Table

August 31, 2013, 5:13 am

≫ Next: ASM metadata blocks

≪ Previous: Allocation Table

The Partnership and Status Table (PST) contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data.

PST count

In an external redundancy disk group there will be only one copy of the PST.

In a normal redundancy disk group there will be at least two copies of the PST. If there are three or more failgroups, there will be three copies of the PST.

In a high redundancy disk group there will be at least three copies of the PST. If thre are four failgroups, there will be four PST copies, and if there are five or more failgroups there will be five copies of the PST.

Let's have a look. Note that in each example, the disk group is created with five disks.

External redundancy disk group.

SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log:

Sat Aug 31 20:44:59 2013
SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:44:59 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:45:00 2013
GMON updating group 2 at 50 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
Sat Aug 31 20:45:00 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates only one copy of the PST.

Normal redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:49:28 2013
SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:49:28 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:49:28 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:49:28 2013
GMON updating group 2 at 68 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
Sat Aug 31 20:49:28 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates three copies of the PST.

High redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:51:52 2013
SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:51:52 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:51:53 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:51:53 2013
GMON updating group 2 at 77 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
NOTE: group DG1: initial PST location: disk 0003 (PST copy 3)
NOTE: group DG1: initial PST location: disk 0004 (PST copy 4)
Sat Aug 31 20:51:53 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates five copies of the PST.

PST relocation

The PST would be relocated in the following cases

The disk with the PST is not available (on ASM startup)
The disk goes offline
There was an I/O error while reading/writing to/from the PST
Disk is dropped gracefully

In all cases the PST would be relocated to another disk in the same failgroup (if a disk is available in the same failure group) or to another failgroup (that doesn't already contain a copy of the PST).

Let's have a look.

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8';

Diskgroup created.

ASM alert log shows the PST copies are on disks 0, 1 and 2:

NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)

Let's drop disk 0:

SQL> select disk_number, name, path from v$asm_disk_stat
where group_number = (select group_number from v$asm_diskgroup_stat where name='DG1');

DISK_NUMBER NAME PATH
----------- ------------------------------ ----------------
3 DG1_0003 /dev/sdc8
2 DG1_0002 /dev/sdc7
1 DG1_0001 /dev/sdc6
0 DG1_0000 /dev/sdc5

SQL> alter diskgroup DG1 drop disk DG1_0000;

Diskgroup altered.

ASM alert log

Sat Aug 31 21:04:29 2013
SQL> alter diskgroup DG1 drop disk DG1_0000
...
NOTE: initiating PST update: grp 2 (DG1), dsk = 0/0xe9687ff6, mask = 0x6a, op = clear
Sat Aug 31 21:04:37 2013
GMON updating disk modes for group 2 at 96 for pid 24, osid 16502
NOTE: group DG1: updated PST location: disk 0001 (PST copy 0)
NOTE: group DG1: updated PST location: disk 0002 (PST copy 1)
NOTE: group DG1: updated PST location: disk 0003 (PST copy 2)
...

We see that the PST copy from disk 0 was moved to disk 3.

Disk Partners

A disk partnership is a symmetric relationship between two disks in a high or normal redundancy disk group. There is no disk partnership in an external disk groups. For a discussion on this topic, please see the post How many partners.

PST Availability

The PST has to be available before the rest of ASM metadata. When the disk group mount is requested, the GMON process (on the instance requesting a mount) reads all disks in the disk group to find and verify all available PST copies. Once it verifies that there are enough PSTs for a quorum, it mounts the disk group. From that point on, the PST is available in the ASM instance cache, stored in the GMON PGA and protected by an exclusive lock on the PT.n.0 enqueue.

As other ASM instances, in the same cluster, come online they cache the PST in their GMON PGA with shared PT.n.0 enqueue.

Only the GMON (the CKPT in 10gR1) that has an exclusive lock on the PT enqueue, can update the PST information on disks.

PST (GMON) tracing

The GMON trace file will log the PST info every time a disk group mount is attempted. Note that I said attempted, not mounted, as the GMON will log the information regardless of the mount being successful or no. This information may be valuable to Oracle Support in diagnosing disk group mount failures.

This would be a typical information logged in the GMON trace file on a disk group mount:

=============== PST ====================
grpNum: 2
grpTyp: 2
state: 1
callCnt: 103
bforce: 0x0
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next: 7
last: 7
pst count: 3
pst locations: 1 2 3
incarn: 4
dta size: 4
version: 0
ASM version: 168820736 = 10.1.0.0.0
contenttype: 0
--------------- LOC MAP ----------------
0: dirty 0 cur_loc: 0 stable_loc: 0
1: dirty 0 cur_loc: 0 stable_loc: 0
--------------- DTA --------------------
1: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 3 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 2 (amp)
...

The section marked === PST === tells us the group number (grpNum), type (grpTyp) and state. The section marked --- HDR --- shows the number of PST copies (pst count) and the disk numbers that have those copies (pst locations). The secion marked --- DTA --- shows the actual state of the disks with the PST.

Conclusion

The Partnership and Status Table contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data. As the PST is a valuable ASM metadata, it is mirrored three times in a normal redundancy disk group and five times in a high redundancy disk group - provided there are enough failgroups of course.

↧

ASM metadata blocks

September 28, 2013, 7:23 am

≫ Next: Tell me about your ASM

≪ Previous: Partnership and Status Table

An ASM instance manages the metadata needed to make ASM files available to Oracle databases and other ASM clients. ASM metadata is stored in disk groups and organised in metadata structures. These metadata structures consist of one or more ASM metadata blocks. For example, the ASM disk header consist of a single ASM metadata block. Other structures, like the Partnership and Status Table, consist of exactly one allocation unit (AU). Some ASM metadata, like the File Directory, can span multiple AUs and will not have the predefined size; in fact, the File Directory will grow as needed and will be managed as any other ASM file.

ASM metadata block types

The following are the ASM metadata block types:

KFBTYP_DISKHEAD - The ASM disk header - the very first block in every ASM disk. A copy of this block will be in the second last Partnership and Status Table (PST) block (in ASM version 11.1.0.7 and later). The copy of this block will also be in the very first block in Allocation Unit 11, for disk groups with COMPATIBLE.ASM=12.1 or higher.
KFBTYP_FREESPC - The Free Space Table block.
KFBTYP_ALLOCTBL - The Allocation Table block.
KFBTYP_PST_META - The Partnership and Status Table (PST) block. The PST blocks 0 and 1 will be of this type.
KFBTYP_PST_DTA - The PST blocks with the actual PST data.
KFBTYP_PST_NONE - The PST block with no PST data. Remember that Allocation Unit 1 (AU1) on every disk is reserved for the PST, but only some disks will have the PST data.
KFBTYP_HBEAT - The heartbeat block, in the PST.
KFBTYP_FILEDIR - The File Directory block.
KFBTYP_INDIRECT - The Indirect File Directory block, containing a pointer to another file directory block.
KFBTYP_LISTHEAD - The Disk Directory block. The very first block in the ASM disk directory. The field kfdhdb.f1b1locn in the ASM disk header will point the the allocation unit whose block 0 will be of this type.
KFBTYP_DISKDIR - The rest of the blocks in the Disk Directory will be of this type.
KFBTYP_ACDC - The Active Change Directory (ACD) block. The very first block of the ACD will be of this type.
KFBTYP_CHNGDIR - The blocks with the actual ACD data.
KFBTYP_COD_BGO - The Continuing Operations Directory (COD) block for background operations data.
KFBTYP_COD_RBO - The COD block that marks the rollback operations data.
KFBTYP_COD_DATA - The COD block with the actual rollback operations data.
KFBTYP_TMPLTDIR - The Template Directory block.
KFBTYP_ALIASDIR - The Alias Directory block.
KFBTYP_SR - The Staleness Registry block.
KFBTYP_STALEDIR - The Staleness Directory block.
KFBTYP_VOLUMEDIR -The ADVM Volume Directory block.
KFBTYP_ATTRDIR -The Attributes Directory block.
KFBTYP_USERDIR - The User Directory block.
KFBTYP_GROUPDIR - The User Group Directory block.
KFBTYP_USEDSPC - The Disk Used Space Directory block.
KFBTYP_ASMSPFALS -The ASM spfile alias block.
KFBTYP_PASWDDIR - The ASM Password Directory block.
KFBTYP_INVALID - Not an ASM metadata block.

Note that the KFBTYP_INVALID is not an actual block type stored in ASM metadata block. Instead, ASM will return this if it encounters a block where the type is not one of the valid ASM metadata block types. For example if the ASM disk header is corrupt, say zeroed out, ASM will report it as KFBTYP_INVALID. We will also see the same when reading such block with the kfed tool.

ASM metadata block

The default ASM metadata block size is 4096 bytes. The block size will be specified in the ASM disk header field kfdhdb.blksize. Note that the ASM metadata block size has nothing to do with the database block size.

ASM metadata block header

The first 32 bytes of an ASM metadata block contains the block header (not to be confused with the ASM disk header). The block header has the following information:

kfbh.endian - Platform endianness.
kfbh.hard - H.A.R.D. (Hardware Assisted Resilient Data) signature.
kfbh.type - Block type.
kfbh.datfmt - Block data format.
kfbh.block.blk - Location (block number).
kfbh.block.obj - Data type held in this block.
kfbh.check - Block checksum.
kfbh.fcn.base - Block change control number (base).
kfbh.fcn.wrap - Block change control number (wrap).

The FCN is the ASM equivalent of database SCN.

The rest of the contents of an ASM metadata block will be specific to the block type. In other words, an ASM disk header block will have the disk header specific data - disk number, disk name, disk group name, etc. A file directory block will have the extent location data for a file, etc.

Conclusion

An ASM instance manages ASM metadata blocks. It creates them, updates them, calculates and updates the check sum on writes, reads and verifies the check sums on reads, exchanges the blocks with other instances, etc. ASM metadata structures consist of one of more ASM metadata blocks. A tool like kfed can be used to read and modify ASM metadata blocks.

↧

Tell me about your ASM

December 25, 2013, 3:23 am

≫ Next: The ASM password directory

≪ Previous: ASM metadata blocks

When diagnosing ASM issues, it helps to know a bit about the setup - disk group names and types, the state of disks, ASM instance initialisation parameters and if any rebalance operations are in progress. In those cases I usually ask for an HTML report, that gets produced by running the SQL script against one of the ASM instances. This post is about that script with the comments about the output.

The script

First, here is the script, that may be saved as asm_report.sql:

spool /tmp/ASM_report.html
set markup html on
set echo off
set feedback off
set pages 10000
break on INST_ID on GROUP_NUMBER
prompt ASM report
select to_char(SYSDATE, 'DD-Mon-YYYY HH24:MI:SS') "Time" from dual;
prompt Version
select * from V$VERSION where BANNER like '%Database%' order by 1;
prompt Cluster wide operations
select * from GV$ASM_OPERATION order by 1;
prompt
prompt Disk groups, including the dismounted disk groups
select * from V$ASM_DISKGROUP order by 1, 2, 3;
prompt All disks, including the candidate disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, LABEL, PATH, MOUNT_STATUS, HEADER_STATUS, STATE, OS_MB, TOTAL_MB, FREE_MB, CREATE_DATE, MOUNT_DATE, SECTOR_SIZE, VOTING_FILE, FAILGROUP_TYPE
from V$ASM_DISK
where MODE_STATUS='ONLINE'
order by 1, 2;
prompt Offline disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, MOUNT_STATUS, HEADER_STATUS, STATE, REPAIR_TIMER
from V$ASM_DISK
where MODE_STATUS='OFFLINE'
order by 1, 2;
prompt Disk group attributes
select GROUP_NUMBER, NAME, VALUE from V$ASM_ATTRIBUTE where NAME not like 'template%' order by 1;
prompt Connected clients
select * from V$ASM_CLIENT order by 1, 2;
prompt Non-default ASM specific initialisation parameters, including the hidden ones
select KSPPINM "Parameter", KSPFTCTXVL "Value"
from X$KSPPI a, X$KSPPCV2 b
where a.INDX + 1 = KSPFTCTXPN and (KSPPINM like '%asm%' or KSPPINM like '%balance%' or KSPPINM like '%auto_manage%') and kspftctxdf = 'FALSE'
order by 1 desc;
prompt Memory, cluster and instance specific initialisation parameters
select NAME "Parameter", VALUE "Value", ISDEFAULT "Default"
from V$PARAMETER
where NAME like '%target%' or NAME like '%pool%' or NAME like 'cluster%' or NAME like 'instance%'
order by 1;
prompt Disk group imbalance
select g.NAME "Diskgroup",
100*(max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576)))-min((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))))/max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))) "Imbalance",
count(*) "Disk count",
g.TYPE "Type"
from V$ASM_DISK_STAT d , V$ASM_DISKGROUP_STAT g
where d.GROUP_NUMBER = g.GROUP_NUMBER and d.STATE = 'NORMAL' and d.MOUNT_STATUS = 'CACHED'
group by g.NAME, g.TYPE;
prompt End of ASM report
set markup html off
set echo on
set feedback on
exit

To produce the report, that will be saved as /tmp/ASM_report.html, run the following command as the OS user that owns the Grid Infrastructure home (usually grid or oracle), against an ASM instance (say +ASM1), like this:

$ sqlplus -S / as sysasm @asm_report.sql

To save the output in a different location or under a different name, just modify the spool command (line 1 in the script).

The report

The reports first shows the time of the report and ASM version.

It then shows if there are any ASM operations in progress. In this excerpt we see a rebalance running in ASM instance 1. It can also be seen that the resync and rebalance have completed and that the compacting is the only outstanding operation:

Next we see the information about all disk groups, including the dismounted disk groups. This is then followed by the info about disks, again with the note that this includes the candidate disks.

I have separated the info about offline disks, as this may be of interest when dealing with disk issues. That section looks like this:

Next are the disk group attributes, with the note that this will be displayed only for ASM version 11.1 and later, as we did not have the disk group attributes in earlier versions.

This is followed by the list of connected clients, usually database instances served by that ASM instance.

The section with ASM initialisation parameters includes hidden and some Exadata specific (_auto_manage) parameters. Here is a small sample:

I have also separated the memory, cluster and instance specific initialisation parameters as they are often of special interest.

The last section shows the disk group imbalance report.

Sample reports

Here is a sample report from an Exadata system: ASM_report_Exa.html.

And here is a sample report from a version 12c Oracle Restart system: ASM_report_12c.html.

Conclusion

While I use this report for a quick overview of the ASM, it can also be used as a 'backup' info about your ASM setup. You are welcome to modify the script to produce a report that suits your needs. Please let me know if you find any issues with the script or if you have suggestions for improvements.

Acknowledgments

The bulk of the script is based on My Oracle Support (MOS) Doc ID 470211.1, by Oracle Support engineer Esteban D. Bernal.

The imbalance SQL is based on the Reporting Disk Imbalances script from Oracle Press book Oracle Automatic Storage Management, Under-the-Hood & Practical Deployment Guide, by Nitin Vengurlekar, Murali Vallath and Rich Long.

↧

The ASM password directory

January 27, 2014, 12:37 am

≫ Next: Error count

≪ Previous: Tell me about your ASM

Password file authentication for Oracle Database or ASM can work for both local and remote connections. In Oracle version 12c, the password files can reside in an ASM disk group. The ASM metadata structure for managing the passwords is the ASM Password Directory - ASM metadata file number 13.

Note that the password files are accessible only after the disk group is mounted. One implication of this is that no remote SYSASM access to ASM and no remote SYSDBA access to database is possible, until the disk group with the password file is mounted.

The password file

The disk group based password file is managed by ASMCMD commands, ORAPWD tool and SRVCTL commands. The password file can be created with ORAPWD and ASMCA (at the time ASM is configured). All other password file manipulations are performed with ASMCMD or SRVCTL commands.

The COMPATIBLE.ASM disk group attribute must be set to at least 12.1 for the disk group where the password is to be located. The SYSASM privilege is required to manage the ASM password file and the SYSDBA privilege is required to manage the database password file.

Let's create the ASM password file in disk group DATA.

First make sure the COMPATIBLE.ASM attribute is set to the minimum required value:

$ asmcmd lsattr -G DATA -l compatible.asm

Name Value

compatible.asm 12.1.0.0.0

Create the ASM password file:

$ orapwd file='+DATA/orapwasm' asm=y

Enter password for SYS: *******

$

Get the ASM password file name:

$ asmcmd pwget --asm

+DATA/orapwasm

And finally, find the ASM password file location and fully qualified name:

$ asmcmd find +DATA "*" --type password

+DATA/ASM/PASSWORD/pwdasm.256.837972683

+DATA/orapwasm

From this we see that +DATA/orapwasm is an alias to the actual file that has a special (+[DISK GROUP NAME]/ASM/PASSWORD) placeholder.

The ASM password directory

The ASM metadata structure for managing the disk group based passwords is the ASM Password Directory - ASM metadata file number 13. Note that the password file is also managed by the ASM File Directory, like any other ASM based file. I guess this redundancy just highlights the importance of the password file.

Let's locate the ASM Password Directory. As that is file number 13, we can look it up in the ASM File Directory. That means, we first need to locate the ASM File Directory itself. The pointer to the first AU of the ASM File Directory is in the disk header of disk 0, in the field f1b2locn:

First match the disk numbers to disk paths for disk group DATA:

$ asmcmd lsdsk -p -G DATA | cut -c12-21,78-88

Disk_Num Path

0 /dev/sdc1

1 /dev/sdd1

2 /dev/sde1

3 /dev/sdf1

Now get the starting point of the disk directory:

$ kfed read /dev/sdc1 | grep f1b1locn

kfdhdb.f1b1locn: 10 ; 0x0d4: 0x0000000a

This is telling us that the file directory starts at AU 10 on that disk. Now look up block 13 in AU 10 - that will be the directory entry for ASM file 13, i.e. the ASM Password Directory.

$ kfed read /dev/sdc1 aun=10 blkn=13 | egrep "au|disk" | head

kfffde[0].xptr.au: 47 ; 0x4a0: 0x0000002f

kfffde[0].xptr.disk: 2 ; 0x4a4: 0x0002

kfffde[1].xptr.au: 45 ; 0x4a8: 0x0000002d

kfffde[1].xptr.disk: 1 ; 0x4ac: 0x0001

kfffde[2].xptr.au: 46 ; 0x4b0: 0x0000002e

kfffde[2].xptr.disk: 3 ; 0x4b4: 0x0003

kfffde[3].xptr.au: 4294967295 ; 0x4b8: 0xffffffff

kfffde[3].xptr.disk: 65535 ; 0x4bc: 0xffff

...

The output is telling us that the ASM Password Directory is in AU 47 on disk 2 (with copies in AU 45 on disk 1, and AU 46 on disk 3). Note that the ASM Password Directory is triple mirrored, even in a normal redundancy disk group.

Now look at AU 47 on disk 2:

$ kfed read /dev/sde1 aun=47 blkn=1 | more

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 29 ; 0x002: KFBTYP_PASWDDIR

...

kfzpdb.block.incarn: 3 ; 0x000: A=1 NUMM=0x1

kfzpdb.block.frlist.number: 4294967295 ; 0x004: 0xffffffff

kfzpdb.block.frlist.incarn: 0 ; 0x008: A=0 NUMM=0x0

kfzpdb.next.number: 15 ; 0x00c: 0x0000000f

kfzpdb.next.incarn: 3 ; 0x010: A=1 NUMM=0x1

kfzpdb.flags: 0 ; 0x014: 0x00000000

kfzpdb.file: 256 ; 0x018: 0x00000100

kfzpdb.finc: 837972683 ; 0x01c: 0x31f272cb

kfzpdb.bcount: 15 ; 0x020: 0x0000000f

kfzpdb.size: 512 ; 0x024: 0x00000200

...

The ASM metadata block type is KFBTYP_PASWDDIR, i.e. the ASM Password Directory. We see that it points to ASM file number 256 (kfzpdb.file=256). From the earlier asmcmd find command we already know the actual password file is ASM file 256 in disk group DATA:

$ asmcmd ls -l +DATA/ASM/PASSWORD

Type Redund Striped Time Sys Name

PASSWORD HIGH COARSE JAN 27 18:00:00 Y pwdasm.256.837972683

Again, note that the ASM password file is triple mirrored (Redund=HIGH) even in a normal redundancy disk group.

Conclusion

Starting with ASM version 12c we can store ASM and database password files in an ASM disk group. The ASM password can be created at the time of Grid Infrastructure installation or later with the ORAPWD command. The disk group based password files are managed with ASMCMD, ORAPWD and SRVCTL commands.

↧

Error count

November 25, 2015, 2:37 am

≫ Next: ASM data scrubbing

≪ Previous: The ASM password directory

This tiny Perl script might be used to report the error type and count for ASM disks in engineered systems, including Exadata. In those systems, the ASM uses griddisks, that are created from celldisks. The celldisks are in turn created from the physical disks.

errorCount.pl script

To quickly check for errors on any of those disks, we can use errorCount.pl Perl script. This is the complete script with comments:

#!/usr/bin/perl

# Process lines from standard input or a file(s)

while (<>) {

# Strip whitespace

s/\s+//g;

# Get a disk name

if ( /name/ ) {

$name = $_;

}

# Get error type for non-zero counts

elsif ( /err.*?Count:[1-9]/ ) {

$errTypeCount = $_;

# Print the disk name and the error type/count

print "$name, $errTypeCount\n";

}

Stripped to bare bones the errorCount.pl becomes:

#!/usr/bin/perl

while (<>) {

s/\s+//g;

if ( /name/ ) { $name = $_ }

elsif ( /err.*?Count:[1-9]/ ) { print "$name, $_\n" }

}

Usage

Use the script with the output of the cellcli -e list physicaldisk|celldisk|griddisk detail command, on an Exadata storage cell. For example:

# cellcli -e list griddisk detail | errorCount.pl

name:DATA_CD_00_exacell03, errorCount:342

name:RECO_CD_00_exacell03, errorCount:728

name:RECO_CD_06_exacell03, errorCount:8

Use the script with the output of a dcli command, that is normally run on a database server. For example:

# dcli -g cell_group -l root cellcli -e list celldisk detail | errorCount.pl

exacell01:name:CD_03_exacell01, exacell01:errorCount:80

exacell01:name:CD_06_exacell01, exacell01:errorCount:64

The above shows the errors on cell disks 3 and 6, on storage cell 1. Have a closer look at those cell disks:

# dcli -c exacell01 -l root cellcli -e list celldisk CD_03_exacell01,CD_06_exacell01 detail

exacell01: name: CD_03_exacell01

exacell01: comment:

exacell01: creationTime: 2015-09-22T10:59:08+10:00

exacell01: deviceName: /dev/sdd

exacell01: devicePartition: /dev/sdd

exacell01: diskType: HardDisk

exacell01: errorCount: 80

exacell01: freeSpace: 0

exacell01: id: bb74cae4-bb47-4d95-b7ee-e3cc5bdf780f

exacell01: interleaving: none

exacell01: lun: 0_3

exacell01: physicalDisk: E1D9RY

exacell01: raidLevel: 0

exacell01: size: 557.859375G

exacell01: status: normal

exacell01:

exacell01: name: CD_06_exacell01

exacell01: comment:

exacell01: creationTime: 2015-09-22T10:59:08+10:00

exacell01: deviceName: /dev/sdg

exacell01: devicePartition: /dev/sdg

exacell01: diskType: HardDisk

exacell01: errorCount: 64

exacell01: freeSpace: 0

exacell01: id: 404565b2-1be7-4171-8678-9991157156da

exacell01: interleaving: none

exacell01: lun: 0_6

exacell01: physicalDisk: E1EB4J

exacell01: raidLevel: 0

exacell01: size: 557.859375G

exacell01: status: normal

Use the script with the sundiag [physicaldisk|celldisk|griddisk]-detail.out files. For example on a celldisk detailed report:

# errorCount.pl celldisk-detail.out

name:CD_00_exacell03, errorCount:1070

name:CD_04_exacell03, errorCount:4200

name:CD_06_exacell03, errorCount:8

name:FD_02_exacell03, errorCount:5300

Or on a physical disk detailed report:

# errorCount.pl physicaldisk-detail.out

name:20:0, errMediaCount:1000

name:20:5, errMediaCount:2000

name:FLASH_1_0, errHardWriteCount:3000

name:FLASH_1_0, errMediaCount:4000

name:FLASH_1_0, errSeekCount:5000

name:FLASH_1_1, errOtherCount:6000

name:FLASH_4_0, errHardReadCount:7000

Yes, I made the numbers up, to make the output interesting.

The diamond operator (<>) in the while loop, lets us process multiple files, like this:

# errorCount.pl celldisk-detail.out physicaldisk-detail.out

...

But a quicker way to do the above would be:

# cat *detail.out | errorCount.pl

name:CD_03_dmq1cel04, errorCount:2

name:CD_07_dmq1cel04, errorCount:2

name:CD_09_dmq1cel04, errorCount:1

name:CD_11_dmq1cel04, errorCount:1

name:DATA_CD_03_dmq1cel04, errorCount:2

name:DATA_CD_07_dmq1cel04, errorCount:2

name:DATA_CD_09_dmq1cel04, errorCount:1

name:DATA_CD_11_dmq1cel04, errorCount:1

Check any count

The script can be easily modified to report on any disk attribute that reports a non-zero count. For example to check if there is free space on a cell disk, we can use the modified script freeSpace.pl:

#!/usr/bin/perl

while (<>) {

s/\s+//g;

if ( /name/ ) { $name = $_ }

elsif ( /freeSpace:[1-9]/ ) { print "$name, $_\n" }

}

Like this:

# dcli -g cell_group -l root cellcli -e list celldisk detail | freeSpace.pl

exacell01:name:CD_00_exacell01, exacell01:freeSpace:528.6875G

Conclusion

In engineered systems, including Exadata, the ASM uses griddisks, that are created from celldisks. The celldisks are in turn created from the physical disks. To quickly check for errors on any of those disks, we can use the errorCount.pl Perl script, either directly on the cell or via the dcli utility, that we run on a database server.

↧

ASM data scrubbing

December 8, 2015, 12:59 am

≫ Next: Quorum disks in Exadata

≪ Previous: Error count

According toWikipedia, data scrubbing is “an error correction technique that uses a background task to periodically inspect main memory or storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.”

Oracle documentation says that ASM disk scrubbing “improves availability and reliability by searching for data that may be less likely to be read. Disk scrubbing checks logical data corruptions and repairs them automatically in normal and high redundancy disks groups. The scrubbing process repairs logical corruptions using the mirror disks. Disk scrubbing can be combined with disk group rebalance to reduce I/O resources.” This feature is available in ASM version 12c.

Setup

First let's look at my setup. I have a single normal redundancy disk group DATA, with ASM filter driver (AFD) disks.

[grid@dbserver]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Tue Dec 8 14:08:22 2015

...

SQL> select NAME, TYPE, STATE from v$asm_diskgroup_stat;

NAME TYPE STATE

------------ ------ -----------

DATA NORMAL MOUNTED

SQL> select DISK_NUMBER, PATH from v$asm_disk_stat;

DISK_NUMBER PATH

----------- ----------------

0 AFD:ASMDISK01

2 AFD:ASMDISK03

4 AFD:ASMDISK05

1 AFD:ASMDISK02

6 AFD:ASMDISK07

5 AFD:ASMDISK06

3 AFD:ASMDISK04

7 rows selected.

SQL>

And I have some datafiles in that disk group:

[grid@dbserver]$ asmcmd ls +DATA/BR/DATAFILE

SYSTEM.261.887497785

USERS.262.8874978313

UNDOTBS1.269.887497831

SYSAUX.270.887497739

T1.305.897911209

T2.306.897911479

T3.307.897911659

Scrubbing a file

We can scrub a disk group, an individual disk or an individual file. In this post I will demonstrate the feature by scrubbing a single file.

To scrub a file, we connect to an ASM instance and run the ALTER DISKGROUP SCRUB FILE command:

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait;

...

The REPAIR keyword is optional, in which case a corruption would be reported, but not repaired.

In the ASM alert log we see the time stamped command, and once the scrubbing finishes, we see the outcome, which is something like this:

Tue Dec 08 11:55:56 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait

Tue Dec 08 11:55:56 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 11:56:43 2015

NOTE: Scrubbing finished

Tue Dec 08 11:56:43 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait

There were no corruptions in my file, so none were reported.

While the scrubbing was running, I saw two ASM processesdoing the actual work:

[grid@dbserver]$ ps -ef | grep asm_sc

grid 17902 1 0 11:27 ? 00:00:00 asm_scrb_+ASM

grid 24365 1 0 11:49 ? 00:00:01 asm_scc0_+ASM

...

SCRB - ASM disk scrubbing master process that coordinates disk scrubbing operations.

SCCn - ASM disk scrubbing slave check process that performs the checking operations. The possible processes are SCC0-SCC9.

We would see additional processes during the repair operation:

SCRn - ASM disk scrubbing slave repair process that performs repair operation. The possible processes are SCR0-SCR9.

SCVn - ASM disk scrubbing slave verify process that performs the verifying operations. The possible processes are SCV0-SCV9.

Corrupted block found

To make this interesting, let's corrupt one block - say block 200 - in that datafile. First I usedfind_block.pl script, to locate both copies of the block:

[grid@dbserver ]$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/BR/DATAFILE/t3.307.897911659 200

dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200.dd

dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200.dd

Then I used those dd commands to extract the blocks.

[root@dbserver]# dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200_primary.dd

[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_mirror.dd

Using diff command I verified both copies are exactly the same.

[root@dbserver]# diff block_200_primary.dd block_200_mirror.dd

[root@dbserver]#

Let's now corrupt the mirror block, which is not trivial with the ASM filter driver.

First create a text file, to use as a corruption. Sure, I could have just used /dev/zero for that.

[root@dbserver]# od -c block_200_mirror.dd > block_200_corrupt.txt

Shut down the database and ASM, then unload the ASM filter driver, otherwise I cannot dd into that disk.

[root@dbserver]# modprobe -r oracleafd

[root@dbserver]# lsmod | grep oracle

oracleacfs 3308260 0

oracleadvm 508030 0

oracleoks 506741 2 oracleacfs,oracleadvm

Now corrupt the mirror copy of block 200.

[root@dbserver]# dd if=block_200_corrupt.txt of=/dev/sdd1 bs=8192 count=1 seek=1462088

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.00160027 s, 5.1 MB/s

Confirm the mirror block is corrupt, by reading the content again.

[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_corr.dd

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.00152298 s, 5.4 MB/s

[root@dbserver]# diff block_200_primary.dd block_200_corr.dd

Binary files block_200_primary.dd and block_200_corr.dd differ

[root@dbserver]#

As we can see, the content of the mirror block is now different.

Scrub a file

Load the ASM filter driver back.

[root@dbserver]# insmod /lib/modules/.../oracleafd.ko

And restart the ASM instance.

We can now scrub the datafile.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait;

...

The ASM alert log shows that a corrupted block was found.

Tue Dec 08 13:25:48 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait

Starting background process SCRB

Tue Dec 08 13:25:48 2015

SCRB started with pid=25, OS id=4585

Tue Dec 08 13:25:48 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 13:25:49 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:25:50 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:26:39 2015

NOTE: Scrubbing finished

Tue Dec 08 13:26:39 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait

In the background dump destination we see newly generated trace files.

[grid@dbserver]$ ls -lrt | tail

-rw-r-----. 1 grid oinstall 36887 Dec 8 13:25 +ASM_scc0_4587.trc

-rw-r-----. 1 grid oinstall 36967 Dec 8 13:25 +ASM_scv0_4592.trc

-rw-r-----. 1 grid oinstall 5088 Dec 8 13:25 +ASM_gmon_16705.trc

-rw-r-----. 1 grid oinstall 36965 Dec 8 13:25 +ASM_scv1_4599.trc

-rw-r-----. 1 grid oinstall 551218 Dec 8 13:26 alert_+ASM.log

And if we have a closer look, in, say the SCC0 trace file, we see the scrubbing report.

[grid@dbserver]$ more ./+ASM_scc0_4587.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_scc0_4587.trc

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

...

HARD VIOLATION(S) DETECTED!

CORRUPTED BLOCK INFORMATION:

memory address: 0x7f13531fb000

block size: 8192

block number: 200 [0xc8]

block offset in this write request of 1 block(s): 0

file type: datafile (type value: 2)

...

HARD check failed at block 200

One way to get a quick scrubbing report would be to grep the trace files for the relevant key words.

[grid@dbserver]$ egrep -i "corrupted|block number|failed" *sc*trc

+ASM_scc0_4587.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scc0_4587.trc: block number: 200 [0xc8]

+ASM_scc0_4587.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scc0_4587.trc:FAILED CHECK(S):

+ASM_scc0_4587.trc: MAGIC NUMBER CHECK FAILED

+ASM_scc0_4587.trc: BLOCK SIZE CHECK FAILED

+ASM_scc0_4587.trc: BLOCK NUMBER CHECK FAILED

+ASM_scc0_4587.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scc0_4587.trc:HARD check failed at block 200

+ASM_scv0_4592.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scv0_4592.trc: block number: 200 [0xc8]

+ASM_scv0_4592.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scv0_4592.trc:FAILED CHECK(S):

+ASM_scv0_4592.trc: MAGIC NUMBER CHECK FAILED

+ASM_scv0_4592.trc: BLOCK SIZE CHECK FAILED

+ASM_scv0_4592.trc: BLOCK NUMBER CHECK FAILED

+ASM_scv0_4592.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scv0_4592.trc:HARD check failed at block 200

+ASM_scv0_4592.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

+ASM_scv1_4599.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scv1_4599.trc: block number: 200 [0xc8]

+ASM_scv1_4599.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scv1_4599.trc:FAILED CHECK(S):

+ASM_scv1_4599.trc: MAGIC NUMBER CHECK FAILED

+ASM_scv1_4599.trc: BLOCK SIZE CHECK FAILED

+ASM_scv1_4599.trc: BLOCK NUMBER CHECK FAILED

+ASM_scv1_4599.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scv1_4599.trc:HARD check failed at block 200

+ASM_scv1_4599.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

[grid@dbserver]$

Fix the corruption

But the corruption wasn't repaired, since I didn't ask for that. Let's fix it - sorry, scrub it.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait;

...

This time the ASM alert log shows the block was scrubbed.

Tue Dec 08 13:35:02 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait

Tue Dec 08 13:35:02 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 13:35:03 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:35:04 2015

NOTE: Scrubbing block 200 in file 307.897911659 in slave

NOTE: Successfully scrubbed block 200 in file 307.897911659

Tue Dec 08 13:35:53 2015

NOTE: Scrubbing finished

Tue Dec 08 13:35:53 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait

Let's check, by reading the block back and comparing it to the primary copy.

[root@dbserver ~]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_after_scrub_repair.dd

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.000856456 s, 9.6 MB/s

[root@dbserver]# diff block_200_primary.dd block_200_after_scrub_repair.dd

[root@dbserver]#

This shows that the block was successfully repaired.

Conclusion

ASM data scrubbing can detect and repair Oracle data blocks that are affected by media or logical corruptions. It can also correct Oracle data blocks written to incorrect location, or overwritten by an external, non-Oracle process (like in my example).

↧