Wednesday, November 7, 2012

Zimbra Incremental Migration: an experience

Some days ago, I managed to migrate our company's Zimbra mail system to a new server.  Since there were so many messages to move, the migration was a hard work, and caused a lot of troubles.

A brief description of the task:
  • The old server: CentOS 5.5 x86_64, ZCS 7.2.0
  • The new server: CentOS 6.3 x86_64, ZCS 8.0.0
  • Mail accounts: 1500, messages: 4 millions, storage: 600 GB
  • Bandwidth between two servers: 100 Mbps
I followed the method described in ZxBackup: Incremental migration with ZeXtras Backup.  The process comprised the following steps:
  1. Backup of all messages on the old server: about 4 million items, backup time: 3 days.
  2. Synchronization of backup data to the new server: data size: 320GB, files: 6.4 million, transfer time: 1 day.
  3. Restore of old messages on the new server: restore time: 5 days.
  4. Incremental backup and restore recent messages since last backup
  5. Switch of the mail flow to the new server
  6. Incremental backup and restore recent messages since last backup
I finished the above steps in 10 days and switched the mail flow to the new server on Sunday. Everything seemed OK.  But when Monday began, hundreds of colleagues came to office, great troubles came along.

Firstly, our colleagues found that their mail IMAP mail clients (Outlook, Foxmail, Thunderbird, ...) always tried to download all messages in the Inbox folders of their accounts on the server, regardless of whether or not the message was read and downloaded before.  The server's outgoing network traffic went up to 100 Mbps easily.

Secondly, some colleagues' mail clients couldn't get and send messages completely.

And the server didn't not respond to users frequently.

I examined the server's log file and found the following errors:

1. SSL errors from IMAP and POP3 proxies (/opt/zimbra/log/nginx.log):
SSL: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
SSL: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac
SSL: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
These errors prevented users to login and get messages.  I don't know where the errors came from.  There were no errors when the commercial certificate was deployed.  So I completely disabled proxies to avoid these errors.

2.   Pop3SSLServer and ImapSSLServer out of memory (/opt/zimbra/log/mailbox.log):
2012-10-30 11:15:45,161 ERROR [ImapSSLServer-96] [name=user@example.com;mid=682;ip=1.2.3.4;] imap - java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
Since all users were downloading large amount of old messages at the same time, the default Java heap size (about 3GB) set by the installation of ZCS is not enough.  I simply change it to a large value:
(run as zimbra user)
zmlocalconfig -e mailboxd_java_heap_size=6144
zmmailboxdctl restart
3. ZeXtras Backup added a tag `0' to all restored messages, but this tag was not properly defined in database.  For some mail client (foxmail), the tag prevented users from downloading message bodies when IMAP is used.  I tried to find out the tag_id(257) of `0' and removed the tag from all messages in databases:
for n in `seq 1 100`; do
mysql << _EOF
update mboxgroup$n.mail_item set tag_names=null where type=5;
delete from mboxgroup$n.tagged_item where tag_id=257;
delete from mboxgroup$n.tag where id=257;
_EOF
done
4. To avoid downloading of all old messages in Inbox, I suggested all users to login to Zimbra via web interface, create a new folder named `Oldmail', move all messages in Inbox to `Oldmail', then move only recent received messages they wanted to download back to Inbox.

On late Tuesday, the server went into a normal status.

3 comments:

  1. As I stated in the original post, I don’t know where the errors came from. So I completely disabled proxies, i.e., do not use nginx proxy, let the servers listen on their default ports.

    ReplyDelete
  2. In the future, you might consider using wiki.zimbra.com/wiki/Ajcody-Notes-Server-Move to move between servers. :) Your downtime would have probably been a couple of hours instead of more than a week. :)

    ReplyDelete
  3. Hello,

    I have this exact problem after upgrading from Zimbra 7 (single server) to Zimbra 8.0.4 Multi Server.

    Does anyone have a clue on how to workaround ?

    Regards,
    Vitor

    ReplyDelete