Custom Search

Thursday, July 07, 2005

Plex Problems and Solutions

Topics:
Displaying State Information for VxVM Objects
Interpreting Plex States
Interpreting Volume States
Interpreting Kernel States
Resolving Plex Problems
Analyzing Plex Problems


Identifying Plex Problems

To identify and solve plex problems, use the following information:
- Plex states
- Volume states
- Plex kernel states
- Volume kernel states
- Object condition flags

Commands to display plex, volume, and kernel states:
vxprint –g diskgroup –ht [volume_name]
vxinfo –p –g diskgroup [volume_name]


Plex States and Condition Flags

EMPTY: indicates that you have not yet defined which plex has the good data (CLEAN), and which plex does not have the good data (STALE).

CLEAN: is normal and indicates that the plex has a copy of the data that represents the volume. CLEAN also means that the volume is not started and is not currently able to handle I/O (by the admin’s control).

ACTIVE: is the same as CLEAN, but the colume is or was currently started, and the colume is or was able to perform I/O.

SNAPDONE: is the same as ACTIVE or CLEAN, but is a plex that has been synchronized with the volume as a result of a “vxassist snapstart” operation. After a reboot or a manual start of the volume, a plex in the SNAPDONE state is removed along with its subdisks.

STALE: indicates that VxVM has reason to believe that the data in the plex is not synchronized with the data in the CLEAN plexes. This state is usually caused by taking the plex offline or by a disk failure.

SNAPATT: indicates that the object is a snapshot that is currently being synchronized but does not yet have a complete copy of the data.

OFFLINE: indicates that the administrator has issued the “vxmend off” command on the plex. When the admin brings the plex back online using the “vxmend on” command, the plex changes to the STALE state.

TEMP: the TEMP state flags (TEMP, TEMPRM, TEMPRMSD) usually indicate that the data was never a copy of the volume’s data, and you should not use these plexes. These temporary states indicate that the plex is currently involved in a synchronization operation with the volume.

NODEVICE: indicates that the disk drive below the plex has failed.

REMOVED: has the same meaning as NODEVICE, but the system admin has requested that the device appear as failed.

IOFAIL: is similar to NODEVICE, but it indicates that an unrecoverable failure occurred on the device, and VxVM has not yet verified whether the disk is actually bad. Note: I/O to both the public and the private regions must fail to change the state from IOFAIL to NODEVICE.

RECOVER: is set on a plex when two conditions are met:
1) A failed disk has been fixed (by using vxreattach or the vxdiskadm option, “Replace a failed or removed disk”).
2) The plex was in the ACTIVE state prior to the failure.


Volume States

EMPTY, CLEAN, and ACTIVE: have the same meanings as they do for plexes.

NEEDSYNC: is the same as SYNC, but the internal read thread has not been started. This state exists so that volumes that use the same disk are not synchronized at the same time, and head thrashing is avoided.

SYNC: indicates that the plexes are involved in read-writeback or RAID-5 parity synchronization:

- Each time that a read occurs from a plex, it is written back to all the other plexes that are in the ACTIVE state.

- An internal read thread is started to read the entire volume (or, after a system crash, only the dirty regions if dirty region logging (DRL) is being used), forcing the data to be synchronized completely. On a RAID-5 volume, the presence of a RAID-5 log speeds up a SYNC operation.

NODEVICE: indicates that none of the plexes have currently accessible disk devices underneath the volume.


Kernel States
Kernel states represent VxVM’s ability to transfer I/O to the volume or plex.

ENABLED: The object can transfer both system I/O and user I/O
DETACHED: The object can transfer system I/O, but not user I/O (maintenance mode)
DISABLED: No I/O can be transferred.


Solving Plex Problems

Commands used to fix plex problems:
vxrecover
vxvol init
vxvol –f start
vxmend fix
vxmend offon


The vxrecover Command

vxrecover –g diskgroup –s [volume_name]
- Recovers and resynchronizes all plexes in a started volume.
- Runs “vxvol start” and “vxplex att” commands (and sometimes “vxvol resync”)
- Works in normal situations
- Resynchronizes all volumes that need recovery if a volume name is not included.


Initializing a Volume’s Plexes

vxvol –g diskgroup init init_type volume_name [plexes]

init_type:
zero: sets all plexes to a value of 0, which means that all bytes are null
active: sets all plexes to active and enables the volume and its plexes
clean: If you know that one of the plexes has the correct data, you can select that particular plex to represent the data of the volume. In this case, all other plexes will copy their content from the clean plex when the volume is started.
enable: use this option to temporarily enable the volume so that data can be loaded onto it to make the plexes consistent.


The “vxvol start” Command

vxvol –g diskgroup –f start volume_name

- This command ignores problems with the volume and starts the volume
- Only use this command on nonredundant volumes. If used on nonredundant volumes, data can be corrupted, unless all mirrors have the same data.


The vxmend Command

vxmend –g diskgroup fix stalecleanactiveempty plex


vxmend fix stale

vxmend –f diskgroup fix stale plex
- This command changes a CLEAN or ACTIVE (RECOVER) state to STALE
- The volume that the plex is associated with must be in DISABLED mode.
- Use this command as an intermediate step to the final destination for the plex state.


vxmend fix clean

vxmend –g diskgroup fix clean plex
- This command changes a STALE plex to CLEAN
- Only run this command if:
1) the associated volume is in the DISABLED state
2) There is no other plex that has a state of clean
3) All of the plexes are in the STALE or OFFLINE states.
- After you change the state of a plex to clean, recover the volume by using:
vxrecover –s


vxmend fix active

vxmend –g diskgroup fix active plex
- This command changes a STALE plex to SCTIVE
- The volume that the plex is associated with must be in DISABLED mode
When you run “vxvol start”:
ACTIVE plexes are synchronized (SYNC) together
RECOVER plexes are set to STALE and are synchronized from the ACTIVE plexes.


vxmend fix empty

vxmend –f diskgroup fix empty volume_name
- Sets all plexes and the volume to the EMPTY state
- Requires the volume to be in DISABLED mode
- Runs on the volume, not on a plex
- Returns to the same state as bottom-up creation


vxmend offon
When analyzing plexes, you can temporarily take plexes offline while validating the data in another plex.
- To take a plex offline, use the command:
vxmend –g diskgroup off plex
- To take the plex out of the offline state, use:
vxmend –g diskgroup on plex


Fixing Layered Volumes
- For layered volumes, vxmend functions the same as with nonlayered volumes.
- When starting the volume, use either:
1) “vxrecover –s” – starts both the top-level volume and the subvolumes
2) “vxvol start” with VxVM 4.0 and later, “vxvol start” completely starts (and stops) layered volumes.


Example: If the Good Plex Is Known
- For plex vol01-01, the disk was turned off and back on and still has data.
- Plex vol01-02 has been offline for several hours.

To recover:
1) Set all plexes to STALE (vxmend fix stale vol01-01)
2) Set the good plex to CLEAN (vxmend fix clean vol01-01)
3) Run “vxrecover –s vol01”


Example: If the Good Plex Is Not Known
The volume is disabled and not startable, and you do not know what happened. There are no CLEAN plexes.

To resolve:
1) Take all but one plex offline and set that plex to CLEAN (vxmend off vol01-02; vxmend fix clean vol01-01)
2) Run “vxrecover –s”
3) Verify data on the volume
4) Run “vxvol stop”
5) Repeat for each plex until you identify the plex with the good data