How to replace a failed disk in Solaris that is part of a an SVM mirror

SVM = Solaris Volume Manager

Take a backup

# metastat –p >/var/tmp/metastat-p-before.txt

# metastat –t >/var/tmp/metastat-t-before.txt

# metadb –i >/var/tmp/metadb-i-before.txt

# echo | format >/var/tmp/echo-format-before.txt

# iostat –en >/var/tmp/iostat-en-before.txt

Identify the failed disk

# metastat

- look for Maintenance

# echo | format

- look for "unknown" etc.

# iostat -en

look at /var/adm/messages log disk error/failure events

Example of metastat output with a failed disk :-

# metastat

d0: Mirror
    Submirror 0: d20
      State: Okay         Tue 07 Jun 2011 06:03:48 PM SGT
    Submirror 1: d10
      State: Needs maintenance Tue 16 Dec 2014 11:35:50 PM SGT
...

d20: Submirror of d0
    State: Okay         Tue 07 Jun 2011 06:03:48 PM SGT
    Size: 10241505 blocks
    Stripe 0:
      Device     Start Dbase State        Hot Spare Time
      c1t1d0s0       0 No     Okay                    Tue 07 Jun 2011 06:03:31 PM SGT

d10: Submirror of d0
    State: Needs maintenance Tue 16 Dec 2014 11:35:50 PM SGT
    Invoke: metareplace d0 c1t0d0s0 <new device>
    Size: 10241505 blocks
    Stripe 0:
      Device     Start Dbase State        Hot Spare Time
      c1t0d0s0       0 No     Maintenance             Tue 16 Dec 2014 11:35:50 PM SGT

d3: Mirror
    Submirror 0: d23
      State: Okay         Tue 07 Jun 2011 06:04:41 PM SGT
    Submirror 1: d13
      State: Needs maintenance Tue 16 Dec 2014 11:35:15 PM SGT
...

d23: Submirror of d3
    State: Okay         Tue 07 Jun 2011 06:04:41 PM SGT
    Size: 56754405 blocks
    Stripe 0:
      Device     Start Dbase State        Hot Spare Time
      c1t1d0s3       0 No     Okay                    Tue 07 Jun 2011 06:04:22 PM SGT

d13: Submirror of d3
    State: Needs maintenance Tue 16 Dec 2014 11:35:15 PM SGT
    Invoke: metareplace d3 c1t0d0s3 <new device>
    Size: 56754405 blocks
    Stripe 0:
      Device     Start Dbase State        Hot Spare Time
      c1t0d0s3       0 No     Maintenance             Tue 16 Dec 2014 11:35:15 PM SGT

# metadb -i

    flags        first blk    block count
     a m p luo       16        1034        /dev/dsk/c1t1d0s5
     a    p luo       16        1034        /dev/dsk/c1t1d0s6
     a    p luo       16        1034        /dev/dsk/c1t1d0s7
    M     p            unknown        unknown        /dev/dsk/c1t0d0s6
    M     p            unknown        unknown        /dev/dsk/c1t0d0s7
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors

In this example, we deduce that disk c1t0 has failed

Identify submirrors to detach

# metastat -p

d0 -m d20 d10 1
d20 1 1 c1t1d0s0
d10 1 1 c1t0d0s0
d3 -m d23 d13 1
d23 1 1 c1t1d0s3
d13 1 1 c1t0d0s3

=> submirrors d10 and d13 are in disk c1t0 and needs to be detached

Detach

# metadetach -f d0 d10
# metadetach -f d3 d13

-f option is to force and the server is likely to complain if we attempt to detach without any option

Run metastat -p again to confirm the submirrors have been detached

# metastat -p

d0 -m d20 1
d20 1 1 c1t1d0s0
d3 -m d23 1
d23 1 1 c1t1d0s3
d10 1 1 c1t0d0s0
d13 1 1 c1t0d0s3

Clear

# metaclear d10

d10: Concat/Stripe is cleared

# metaclear d13

d13: Concat/Stripe is cleared

Run metastat -p again to verify they have been cleared

# metastat -p

d0 -m d20 1
d20 1 1 c1t1d0s0
d3 -m d23 1
d23 1 1 c1t1d0s3

Delete any state database replicase on the failed disk

In above example, the database replicas in /dev/dsk/c1t0d0s6 & /dev/dsk/c1t0d0s7 has "M" flag beside it indicating "replica had problem with master blocks". Delete those and then run metadb -i again to verify they have been removed.

# metadb -d /dev/dsk/c1t0d0s6

# metadb -d /dev/dsk/c1t0d0s7

# metadb -i

    flags        first blk    block count
     a m p luo       16        1034        /dev/dsk/c1t1d0s5
     a    p luo       16        1034        /dev/dsk/c1t1d0s6
     a    p luo       16        1034        /dev/dsk/c1t1d0s7
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors

Remove the failed disk

use either luxadm or cfgadm (depending on the type of disk) to prepare the HDD for removal

# /usr/sbin/luxadm remove_device /dev/rdsk/c1t0d0s2

# cfgadm -c unconfigure c1::dsk/c1t0d0

Remove the faulty disk and Insert the replacement disk

# devfsadm -v

If cfgadm was used to unconfigure, configure it back

# cfgadm -c configure c1::dsk/c1t0d0

Confirm the disk is available by

# echo | format

copy disk structure from the unaffected disk to the replacement disk

# prtvtoc /dev/rdsk/c1t1d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2

if the disk is used for booting,

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c1t0d0s0

create replicas of state database on the new disk, in our example there was 1 copy of the replica in c1t0d0s5, c1t0d0s6 & c1t0d0s7 before the disk failed

# metadb -af /dev/dsk/c1t0d0s5

# metadb -af /dev/dsk/c1t0d0s6

# metadb -af /dev/dsk/c1t0d0s7

Verify the replicas are healthy in the replaced disk

# metadb -i

initialise the new submirrors on the replaced disk

# metainit -f d10 1 1 c1t0d0s0

# metainit -f d13 1 1 c1t0d0s3

attach to the mirrors

# metattach d0 d10

# metattach d3 d13

Confirm the mirrors are starting to sync

# metastat