Exadata (Half Rack) Image Upgrade (Rolling) Part 1-Cell Server

As DMA its regular activity to patch exadata machine. There are two ways of patching Exadata box Rolling and Non-Rolling. In this blog , We will start with part 1 of the Exadata (Half Rack) Image Upgrade (Rolling).

Step 1: Backup Current Configurations

echo "Executing prechecks specific to cell nodes..."

This is one time precheck configs will collect for all 7 cell nodes of the cell group

#cd /root

echo "" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo "************************" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo "Cell specific prechecks" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo "************************" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Need at least 1.5gb + size of ISO file (approx 3gb total for Jan-July releases) space on / partition of cells to do cell update" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group df -h / >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Check all cells are up" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group cellcli -e list cell >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Check cell network configuration" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group "/opt/oracle.cellos/ipconf -verify" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Validate cell disks for valid physicalInsertTime (should be no output)" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group cellcli -e 'list physicaldisk attributes luns where physicalInsertTime = null' >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Check for WriteBack Flash Cache" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group cellcli -e "list cell attributes flashcachemode" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo -e "\n# Check for Flash Cache Compression" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

dcli -l root -g cell_group cellcli -e "list cell attributes flashCacheCompress" >>/u01/exa_img_upg/prechecks/cell_node_prechecks.txt

echo "Executing prechecks specific to compute nodes..."

This is one time precheck configs will collect for all 4 db nodes of the dbs group

#cd /root

echo "" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo "************************" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo "Compute specific prechecks" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo "************************" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# need 3-5 gb on / partition" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group df -h / >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# need ~40 mb on /boot partition" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group df -h /boot >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# check freespace on /u01 partition" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group df -h /u01 >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# Make sure not snaps are still active" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group lvs >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# Need at least 1.5gb gb free PE" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group vgdisplay >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# Mounted filesystems" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group mount >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# Contents of /etc/fstab" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group cat /etc/fstab >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo -e "\n# Contents of /etc/exports" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group cat /etc/exports >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo "Done."

echo -e "\n# Contents of /etc/exports" >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

dcli -l root -g dbs_group cat /etc/exports >>/u01/exa_img_upg/prechecks/compute_node_prechecks.txt

echo "Done."

Step 2: Pre-Validation - Validate if there is any Critical and Stateful alert is leftover on the cell servers

cd /root

[root@abcxyzadm01 ~]# dcli -g cell_group -l root "cellcli -e list alerthistory attributes name,beginTime,alertShortName,alertDescription,severity where alerttype=stateful and severity=critical"

We can drop all the alert history

CellCLI> drop alerthistory all

====

To drop individual alert we need to use below command

Example :

CellCLI> list alerthistory <6_1>

CellCLI> list alerthistory <6_1> detail

CellCLI> drop alerthistory <6_1>

Only drop alert which are taken care.

Step 3: Precheck

[root@abcxyzadm01 patch_PATCHNUMBER]# ./patchmgr -cells <cell_group> -patch_check_prereq -rolling

-rw-r--r-- 1 root root 10 May 4 15:53 cell_node_1

-rw-r--r-- 1 root root 11 May 4 16:10 cell_node_2

-rw-r--r-- 1 root root 10 May 4 16:11 cell_node_3

-rw-r--r-- 1 root root 10 May 4 16:11 cell_node_4

-rw-r--r-- 1 root root 11 May 4 16:11 cell_node_5

-rw-r--r-- 1 root root 10 May 4 16:11 cell_node_6

-rw-r--r-- 1 root root 10 May 4 16:12 cell_node_7

Note : change the cell_group name to one which you are patching

Example : For first cell node

[root@abcxyzadm01 patch_PATCHNUMBER]# ./patchmgr -cells cell_node_1 -patch_check_prereq -rolling

2017-05-08 11:24:52 -0700 [INFO] Disabling /var/log/cellos cleanup on this node for the duration of the patchmgr session.

2017-05-08 11:25:56 -0700 :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...

2017-05-08 11:25:56 -0700 :SUCCESS: DONE: Check cells have ssh equivalence for root user.

2017-05-08 11:26:00 -0700 :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...

2017-05-08 11:26:44 -0700 :SUCCESS: DONE: Initialize files, check space and state of cell services.

2017-05-08 11:26:44 -0700 :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...

2017-05-08 11:26:59 -0700 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.

2017-05-08 11:27:00 -0700 :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.

2017-05-08 11:27:00 -0700 :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...

2017-05-08 11:27:39 -0700 :SUCCESS: DONE: Check prerequisites on all cells.

2017-05-08 11:27:39 -0700 :Working: DO: Execute plugin check for Patch Check Prereq ...

2017-05-08 11:27:39 -0700 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 22909764 v1.0. Details in logfile /u01/exa_img_upg/CELL/patch_PATCHNUMBER/patchmgr.stdout.

2017-05-08 11:27:39 -0700 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.3. Details in logfile /u01/exa_img_upg/CELL/patch_PATCHNUMBER/patchmgr.stdout.

2017-05-08 11:27:41 -0700 :WARNING: ACTION REQUIRED: Cells to be upgraded pass version check, however other cells not being upgraded may be at version 11.2.3.1.x or 11.2.3.2.x, exposing the system to bug 17854520. Manually check other cells for version 11.2.3.1.x or 11.2.3.2.x.

2017-05-08 11:27:41 -0700 :INFO: Checking database homes for remote db nodes with oracle-user ssh equivalence to the local system.

2017-05-08 11:27:41 -0700 :INFO: Database homes that exist only on remote nodes must be checked manually.

2017-05-08 11:27:47 -0700 :SUCCESS: Patchmgr plugin complete: Prereq check passed - no exposure to bug 17854520

2017-05-08 11:27:47 -0700 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 22468216 v1.0. Details in logfile /u01/exa_img_upg/CELL/patch_PATCHNUMBER/patchmgr.stdout.

2017-05-08 11:27:48 -0700 :SUCCESS: Patchmgr plugin complete: Prereq check passed for the bug 22468216

2017-05-08 11:27:48 -0700 :INFO : Patchmgr plugin start: Prereq check for exposure to bug 24625612 v1.0.

2017-05-08 11:27:48 -0700 :INFO : Details in logfile /u01/exa_img_upg/CELL/patch_PATCHNUMBER/patchmgr.stdout.

2017-05-08 11:27:48 -0700 :SUCCESS: Patchmgr plugin complete: Prereq check passed for the bug 24625612

2017-05-08 11:27:48 -0700 :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.

Restart ILOM on all cell nodes (optional)

dcli -l root -g <cell_group> "ipmitool bmc reset cold"

Note: Change the <cell_group> to the cell server you are patching as below

cell_node_1

cell_node_2

cell_node_3

cell_node_4

cell_node_5

cell_node_6

cell_node_7

Example

dcli -l root -g cell_node_1 "ipmitool bmc reset cold"

Check repair times for all mounted disk groups in the Oracle ASM instance and adjust if needed Note: Set disk_repair_time to 8.5 hours.

sqlplus / as sysasm

select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

NAME VALUE

------------ ----------

DATA_DG 3.6h

DBFS_DG 3.6h

RECO_DG 3.6h

If the repair time is not 8.5 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust. alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='8.5h';

Repeat the above statement for each diskgroup as below

alter diskgroup DATA_DG set attribute 'disk_repair_time'='8.5h';

alter diskgroup DBFS_DG set attribute 'disk_repair_time'='8.5h';

alter diskgroup RECO_DG set attribute 'disk_repair_time'='8.5h';

Also increase the rebalance power to 5

alter diskgroup RECO_DG rebalance power 5;

alter diskgroup DATA_DG rebalance power 5;

alter diskgroup DBFS_DG rebalance power 5;

Check uptime and reboot if needed

cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

dcli -l root -g cell_group "uptime"

[root@abcxyzadm01 ~]# dcli -l root -g cell_group "uptime"

xyzceladm01: 12:51:44 up 222 days, 17:17, 0 users, load average: 1.45, 1.33, 1.35

xyzceladm02: 12:51:44 up 222 days, 17:17, 0 users, load average: 1.64, 1.21, 1.25

xyzceladm03: 12:51:44 up 222 days, 17:18, 0 users, load average: 1.34, 1.45, 1.43

xyzceladm04: 12:51:44 up 222 days, 17:17, 0 users, load average: 0.98, 1.25, 1.35

xyzceladm05: 12:51:44 up 222 days, 17:17, 0 users, load average: 1.10, 1.33, 1.42

xyzceladm06: 12:51:44 up 222 days, 17:17, 0 users, load average: 0.88, 1.21, 1.34

xyzceladm07: 12:51:44 up 222 days, 17:17, 0 users, load average: 0.99, 1.21, 1.30

[root@abcxyzadm01 ~]#

Note: If cells up more than 7 days then reboot each cell in rolling fashion using the note below.

Steps to shut down or reboot an Exadata storage cell without affecting ASM (Doc ID 1188080.1)

Note: In our case we need to reboot the cell server

cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

Follow our blog to Cell Node Reboot Steps Without Affecting ASM.

Step 4: Cleanup space from any previous runs

The -reset_force command is only done the first time the cells are patched to this release.

It is not necessary to use the command for subsequent cell patching, even after rolling back the patch.

We need to use 'cleanup' option not 'reset_force' option

[root@abcxyzadm05 ~]#cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

./patchmgr -cells <cell_group> -reset_force

Note : Always use the -cleanup option before retrying a failed or halted run of the patchmgr utility.

[root@abcxyzadm05 ~]#cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

./patchmgr -cells <cell_group> -cleanup

Note: Please use the <cell_gorup> as per respective cell you are patching

Note: Change the <cell_group> to the cell server you are patching as below

cell_node_1

cell_node_2

cell_node_3

cell_node_4

cell_node_5

cell_node_6

cell_node_7

Example : ./patchmgr -cells cell_node_1 -cleanup

Step 5: Run prerequisites check

=======================

#cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

./patchmgr -cells <cell_group> -patch_check_prereq -rolling

Note: Change the <cell_group> to the cell server you are patching as below

cell_node_1

cell_node_2

cell_node_3

cell_node_4

cell_node_5

cell_node_6

cell_node_7

Example : ./patchmgr -cells cell_node_1 -patch_check_prereq -rolling

Step 6: Patch the cell nodes

#cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

dcli -l root -g <cell_group> imageinfo

nohup ./patchmgr -cells <cell_group> -patch -rolling &

Step 7:To Check Progress of Image Upgrade

Monitor the patch progress

Monitor the ILOM console for each cell being patched. You may want to download the ilom-login.sh script from note 1616791.1 for assisting in logging into the iloms.

cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

tail -f nohup.out

Step 8: Post Patch Space Cleanup

Cleanup space

./patchmgr -cells <cell_group> -cleanup

Step 9:

Post Image Upgrade Validations

==============================

Post Checks

dcli -l root -g <cell_group> imageinfo -version

dcli -l root -g <cell_group> imageinfo -status

dcli -l root -g <cell_group> "uname -r"

dcli -l root -g <cell_group> cellcli -e list cell

dcli -l root -g <cell_group> /opt/oracle.cellos/CheckHWnFWProfile

Also in post check - Verify grid disk status:

(a) Verify all grid disks have been successfully put online using the following command:

dcli -l root -g cell_group cellcli -e list griddisk attributes name, asmmodestatus

[root@abcxyzadm01 patch_PATCHNUMBER]# dcli -l root -g <cell_group> cellcli -e list griddisk attributes name, asmmodestatus

xyzcel07: DATA_DG_CD_00_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_01_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_02_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_03_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_04_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_05_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_06_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_07_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_08_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_09_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_10_xyzcel07 ONLINE

xyzcel07: DATA_DG_CD_11_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_02_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_03_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_04_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_05_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_06_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_07_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_08_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_09_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_10_xyzcel07 ONLINE

xyzcel07: DBFS_DG_CD_11_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_00_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_01_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_02_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_03_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_04_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_05_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_06_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_07_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_08_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_09_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_10_xyzcel07 ONLINE

xyzcel07: RECO_DG_CD_11_xyzcel07 ONLINE

[root@abcxyzadm01 patch_PATCHNUMBER]#

(b) Wait until asmmodestatus is ONLINE for all grid disks. Each disk will go to a 'SYNCING' state first then 'ONLINE'. The following is an example of the output:

DATA_CD_00_xyzcel01 ONLINE

DATA_CD_01_xyzcel01 SYNCING

DATA_CD_02_xyzcel01 OFFLINE

DATA_CD_03_xyzcel01 OFFLINE

DATA_CD_04_xyzcel01 OFFLINE

DATA_CD_05_xyzcel01 OFFLINE

DATA_CD_06_xyzcel01 OFFLINE

DATA_CD_07_xyzcel01 OFFLINE

DATA_CD_08_xyzcel01 OFFLINE

DATA_CD_09_xyzcel01 OFFLINE

DATA_CD_10_xyzcel01 OFFLINE

DATA_CD_11_xyzcel01 OFFLINE

( Please note: this operation uses Fast Mirror Resync operation - which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.)

Note: In above command change the <cell_group> to the cell server you are patching as below

cell_node_1

cell_node_2

cell_node_3

cell_node_4

cell_node_5

cell_node_6

cell_node_7

Step 10 : Post Execution After Validations of Image Upgrade

1. Change disk_repair_time back to original value

select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

sqlplus / as sysasm

select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

NAME VALUE

------------------------------ ------------

DATA_DG 8.5h

DBFS_DG 8.5h

RECO_DG 8.5h

Reset the repair time to the original value if it was changed at the start of patching. Replace <diskgroup_name> in the following statement to adjust. alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='<original value>';

alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='3.6h';

Repeat the above statement for each diskgroup as below

alter diskgroup DATA_DG set attribute 'disk_repair_time'='3.6h';

alter diskgroup DBFS_DG set attribute 'disk_repair_time'='3.6h';

alter diskgroup RECO_DG set attribute 'disk_repair_time'=3.6h';

Also put back the rebalance power to 2

alter diskgroup RECO_DG rebalance power 2;

alter diskgroup DATA_DG rebalance power 2;

alter diskgroup DBFS_DG rebalance power 2;

Step 11: Known Issues(In case Image Upgrade Fails)

Additional checks (if there were problems)

cd /u01/exa_img_upg/CELL/patch_PATCHNUMBER

cat patchmgr.stdout

cat _wip_stdout file

ssh <cell-node>

cd /var/log/cellos

grep -i 'fail' validations.log

grep -i 'fail' vldrun*.log

cat validations.log

cat vldrun.upgrade_reimage_boot.log

cat vldrun.first_upgrade_boot.log

cat CheckHWnFWProfile.log

cat cell.bin.install.log

cat cellFirstboot.log

cat exachkcfg.log

cat patch.out.place.sh.log

cat install.sh.log

---------------------------------------------------------------------------------------------------

Rolling Back Successfully Patched Exadata Cells

(This section describes how to roll back successfully patched Exadata Cells. Cells with incomplete or failed patching cannot be rolled back.)

Do not run more than one instance of the patchmgr utility at a time in the deployment.

Check the prerequisites using the following command:

./patchmgr -cells cell_group -rollback_check_prereq [-rolling]

Perform the rollback using the following command:

./patchmgr -cells cell_group –rollback [-rolling]

Click on to for compute node image upgrade.

Click on to for Switch Firmware upgrade.

You can learn in detail on Exadata from book Expert Oracle Exadata

==========================================================

Please check our other blogs for Exadata.

Search This Blog

Database Solutions

Exadata (Half Rack) Image Upgrade (Rolling) Part 1-Cell Server

Comments

Post a Comment

Popular posts from this blog

Restart Innodb MySQL Cluster after Complete outage(All node Down)

Oracle Block Corruption - Detection and Resolution

Add or Remove a node in MySQL Innodb Cluster