Cloud Integrated Services (CIS) on OES
Published Originally in 2021, Significantly updated August 24, 2022 (Work in Progress)
Best Practice Guide
CIS is a service that became available with the release of OES 2018 SP1 in the year 2019. I started working with it shortly after it was released due to a large scale customer implementation. Since then I have worked with the product extensively as it has been developed and improved considerably up until now (Currently working with OES 2018 SP3 + an April 2022 CIS Update)
The intent of this guide is to help you with the implementation and hopefully get you through some of the bigger hurdles. It does not cover everything, and it is a work in process.
Just to be clear on a couple of things related to what the product is supposed to accomplish and how the systems interact, here are a couple of bullet points in the overview:
- CIS requires that you have an S3 Compatible storage target "somewhere". You can use an Amazon S3 target or another provider that offers S3 Compatible services.
- Your original OES servers running NSS storage volumes remain in place. You will not be moving NSS volumes or files to the CIS servers. It's not necessary and I wouldn't recommend it.
- Your CIS Architecture requires multiple servers just for the CIS product to function. These are dedicated servers, they do not (and should not) run on any other OES server, period. Furthermore, you should not attempt to install other OES services on your dedicated CIS systems.
- The CIS product functions very much like the "DST" technology that's been an option in OES for years. The difference is that DST only works locally, while CIS takes it a step further and offloads the data (by policy) to the cloud. That said, DST was a simple configuration in comparison to the complex architecture and system requirements of CIS.
Tip #1: Approach with Caution
CIS is a difficult product to work with and can be an absolute beast. While it has greatly improved, the original release was not ready for production due to the massive amount of problems, bugs, and lack of resources from both a documentation and support standpoint. In general, my experience has been as follows:
- In it's current state it is literally impossible to install the product successfully without getting the developers involved.
- Development resources are extremely limited and mostly outsourced to a team located in India. This creates language barriers, time zone challenges, and extremely poor-quality phone lines when you're trying to do a phone call with them.
- The USA-based OES support team does not know the product or how to work with it. They rely almost completely on developer resources in India to help with any support tickets.
- The Documentation sucks and does not provide the level of detail necessary to properly implement the product or understand how the different components work together..
- Micro Focus has not published any useful TIDs related to overcoming the many technical challenges you will face.
- Regardless of what problem you have, you will likely have to have a dial-in with developers that will involve 2-3 hours worth of time. They will likely delete and recreate all of the certificates used by the solution. When that doesn't work, they will then tell you they will have to get back to you with a solution. You will then wait 2-3 weeks.
- If your main background is with the Novell line of products like OES, GroupWise, or whatever, you'll find out that the entire solution includes a bunch of applications and components you've never seen or used before (Apache Kafka, Apache ZooKeeper, Elasticsearch, and a database), and aren't used by any other OES or Micro Focus service. You'll also find that there are no tidbits of useful information out there about how to manage those components from an OES/CIS perspective.
- It's possible, and likely, that every problem you experience will result in a "defect" being created which will then result in a patch being developed specifically for your issue. Expect this process to take 3-4 weeks.
- In many cases, your project won't be able to progress until each defect is resolved. You won't know what defects lie ahead until you get the current defect resolved. This is because many of the steps require that you complete them in a certain order before moving on to the next step.
- Realistically you could plan on taking months or longer to implement CIS for a production environment due to the stated items above.
Tip #2: Don't install a Single Server "pilot project"
NOTE: This is critical and could save you months of time.
CIS can be installed as a "Single Server" environment (For testing or very small environments) or a much more robust "Multiple Server" implementation for production environments with heavy data demands and storage requirements.
It's common for a project of this magnitude to go through a test phase or pilot. While it seems logical to build a single server for this purpose, you will find that this is an absolute waste of time. You're better off just building a production environment and testing in the production environment on a smaller scale, then shifting to a full production data set. Here are my reasons why:
- The single server environment would appear on the surface to be simple, but it's not. You will experience a great number of challenges, and nobody needs this kind of stress. There is literally nothing learned by setting up a Single Server environment that will benefit you in the production environment. In fact, it will probably cause more confusion than anything.
- If you setup a single server environment for testing, and then you want to go into production with the recommended architecture (minimum of 6 servers), you will more than double the work and possibly the time for implementation. You will first beat your head on the wall implementing the Single Server system, and then you will be beating your head on the wall when you build the production environment.
- The problems you have (and hopefully overcome) in a Single Server environment are not the same problems you will face in a Multiple Server environment. There's no comparison.
- You cannot take your Single Server test system and morph it into a Multiple Sever production system. It's just not practical or possible.
- You won't be able to have a Single Server environment and a Production Environment in the same tree working side by side. It's plausible for you to think that you can keep your pilot project running while building a simultaneous production environment, but it won't work. This is due to the way NSS volumes are used by CIS, and also the way the CIS object configuration is stored in eDirectory.
- Somewhere along the way, I was told that a production environment should not be run on a Single server anyway. Therefore if you want support, you need to use a multi-server environment.
Tip #3: Design a Production Environment right from the start
When building your production environment, here are some things you'll want to do that will help. Note that this won't make it perfect, but will go a long way.
- Build each server from scratch and use the latest available OES 2018 install media. The initial releases of CIS were extremely buggy. As of today, you should be installing CIS using OES 2018 SP3 Media.
- Once the initial server is built, you should register the server and apply all available updates before you insert the OES server into the tree or install the CIS components. This ensures that you are running the newest code before attempting any of the CIS configurations.
- Knowing that you need 6 or more servers, name your servers in a way that makes sense and according to their function.
- CIS relies heavily on DNS and the FQDN of the server. Ensure that the servers are registered with your DNS services and are able to resolve by name.
- CIS relies completely on SSL Server Certificates being generated correctly. In my experience, the default certificates that are installed when you install OES are not adequate and must later be replaced with manually created certs. I go into this in more detail further on in this document. (Hopefully at some point the OES installers will create the needed certificates correctly, but as of writing, I have not had success with this).
- CIS relies on the OES Common Proxy in order to register systems and discover services. More on this later, but it's important that all of your OES servers running NSS volumes and all of your OES servers in the CIS implementation have working OES Common Proxies.
- Do not rename servers after they have had CIS installed on them. There are too many things that go wrong if you rename a server, especially one with CIS on it. Just don't do it.
- Read the entire documentation forward and backward several times before installing. You will find that there are system requirements strewn throughout the documentation and it is not all located in the "System requirements" section. If you aren't aware of this, you could build a server with an unsupported file system and not realize it until it's too late.
- Document every detail of every component on every single server. IP Addresses, names, port numbers, services, paths, etc. Don't assume that you'll be able to figure it out later. You'll be overwhelmed and won't know where to start when you need to dig.
General Server Inventory / What You Need
During the process of installing a production environment you will need a minimum of the following OES servers:
- Three (3) Infrastructure Servers
- One (1) Data Scale Server
- One (1) Database Server
- One (1) Main CIS server (Where the management console and dashboard run)
Your Existing OES Infrastructure with NSS Volumes
- You should have existing OES servers with Novell Storage Services and NSS Volumes.
- I would utilize existing servers, but you may want to implement new systems as well for various reasons.
- Ensure your OES servers are OES 2018 with the latest patches. You will not have the full functionality of CIS if you're not fully up to date.
There are suggestions in the documentation to use OES Clustering for even more confusion. Personally I have not been a fan of Novell Clustering for some time. I won't go into details here, just know that I do not cover OES clustering in this document.
Helpful Troubleshooting Tips
** BELOW THIS POINT IS BASICALLY CHICKEN SCRATCHES. Notes, thoughts, processes, files, and other things that I have had to work with during troubleshooting. I will organize and condense this as I am able to. **
Here are some tips that can help you when you're troubleshooting CIS.
- The command "cishealth -verbose" will tell you what you believe is true, that everything is broken.
- The command "docker node ls" will tell you what nodes are broken in your "Infrastructure Services" clusters.
- The command "docker ps" will tell you what docker processes are running. It likely won't mean anything to you, but it's helpful to know that on the Infrastructure Services server, you should have several processes running.
- The command "rcdocker restart" will restart the docker services on your Infrastructure Service systems.
- When you're troubleshooting CIS, you are not only troubleshooting your CIS systems, but you are also troubleshooting your OES / NSS file servers.
Certificates can be Difficult
I'm optimistic that as the product matures, the problems with certificates will be minimized. But issues with certificates is one of the areas where I experience the most problems. The common theme seems to be that certificates as installed by OES are not valid (or not good enough) for CIS. But it could vary depending on how the server got installed. Here are some things I can tell you about certificates:
- When OES is installed, the certificates are exported to the file system and used by some of the standard OES services. However, CIS does not seem to use them in those default locations.
- When CIS is installed, the certificates are copied to CIS specific path locations rather than just referencing the original certificates. This means that if you ever recreate or update the OES certificates, they have to be manually copied to the CIS location. (Based on recent developments and work done in Aug 2022, it's possible that this has changed. However I have not confirmed.)
- /etc/opt/novell/cis/certs seems to be one of the path names used by CIS.
- However, it's possible that the path is at /media/nss/DATA/etc/opt/novell/cis/certs (Some NSS volume that you've created and configured for use by CIS, probably the main CIS service system).
- nslookup xx.xx.xx.xx should return fqdn of your server.
- Certificate needs the server name and ip address in the Subject Alternative Name apparently.
- Repair certificates
- /etc/ssl/servercerts/servercert.pem needs to have Subject Alternate Name that matches the and apparently it doesn't by default
- "ndsconfig upgrade -j" recreates the certa nd forces it out to the services that use it
- "openssl x509 -in /etc/ssl/servercerts/servercert.pem -noout -text" Should show the server name in the subject alternative in FQDN format.
- server name probably should be in FQDN format in the Subject Alternative. This is set manually when customizing the cert.
- don't forget to do 'namconfig -k" after recreating server certs.
- copy Certificates from /etc/ssl/servercerts to /etc/opt/novell/cis/certs
- need to update kafka keystore on all docker swarm nodes
- on the main cis server after recreating certs:
- systemctl restart oes-cis-configuration.service
- systemctl restart oes-cis-server.target
- Run "kafka_keystore_update.sh" script on the docker swarm servers (IS Services). Restart docker "rcdocker restart"
Some problem with the Database, needed updated (On the MAIN server) after certs were updated.
- resetdbcred.zip file from developers. a script for something related to the database.
- resetdbcred -zk_url oescisis-int1:2282 -db_pass DJDSFLSDKF(Password root)
- /etc/opt/novell/cis/.creds .encCISCreds and .encCISKey were updated to todays date.
- systemctl restart oes-cis-configuration.services oes-cis-fluentbit.service
- systemctl restart oes-cis-server.target
NOW I'm able to access the main CIS Management page.
Need to fix S3 Target Cert. Ensure PEM format. Need to convert .crt to .pem
- systemctl restart oes-cis-data.service
Connect to S3 Target
CIS Admin Server (Dashboard)
Call it what you want, the Main CIS server, the Dashboard, the Administration console.. It's just one of the complex pieces of the CIS puzzle, but if it's not working correctly you won't be able to do anything. Here are some things specific to this server that are helpful when CIS is not working.
This command gives you an overview of what's broken on the Admin/Dashboard server:
The goal in a perfect world is that everything is working and healthy. Most likely it isn't and it reports "Not Healthy" at the end. The output, when the system is broken, can be overwhelming, but it does provide some helpful information if you are patient and can look through it objectively. The command goes through a series of checks and reports on each item such as:
- CIS Configuration
- CIS Infrastructure Services
- CIS Gateway Health
- CIS Core Services
Each section of the report will show what is failing. There is then some detail that explains how you can troubleshoot or resolve the situation. The thing to be aware of is that the problem could be on any of your CIS servers, not just the Admin server. For example, ZooKeeper runs on the Infrastructure Services systems (Generally 3 or more of these). If you're having issues connecting, and the health fails due to that, then the ZooKeeper service needs to be investigated and any issues there resolved.
OES File Server / NSS Volumes
The whole point of CIS is to take files located on your OES file servers and offload them to the CIS Cloud provider. The CIS architecture is ONLY the components that manage the process of moving NSS files to the cloud. However, your actual file systems will remain on their existing OES NSS volumes, you do not move them to the CIS servers. This section is relevant to each OES file server and what is required for them to work correctly with CIS.
Tip #1: Update/Patch Servers.
Numerous CIS components and agents are installed on all OES servers where NSS volumes exist. Ensure that your servers are fully patched for best results. As of writing, the current patch is OES 2018 SP3 with a significate Post SP3 patch for CIS released in April 2022. When you patch the servers, patches to the CIS agents and components are also applied, so having a fully patched OES server will ensure that these critical CIS patches are applied.
Agents on the OES servers that are Relevant
- oes-cis-recall-agent.service (Agent that is responsible for recalling files from the cloud back to the OES server)
- oes-cis-scanner.service (Agent that scans the NSS volumes and provides that info to the CIS server)
- oes-core-agent.service (Connection between core CIS service and OES NSS server)
- oes-dashboard-agent.service (Registers OES Server with the CIS Dashboard)
Restarting the CIS Agents on OES Servers
To restart the agents, you use the standard systemctl command:
- systemctl start/stop/restart service (Example: systemctl start oes-cis-agent.service)
On Each OES Host
- /etc/opt/novell/eDirectory/conf/nds.conf is where the CIS agent name comes for the OES server and is used to register the OES server with the CIS dashboard.
- Pool resources get name from vfs calls (Unsure why I noted this, it seemed relevant)
- /etc/ssl/servercerts/servercert.pem is the server certificate that corresponds to the certificate you created in iManager.
The OES server won't be able to register with CIS if the certificate is formed wrong. You should plan on just recreating the certificate on each OES server where NSS volumes exist that you want to be offloaded to the cloud. When creating a certificate, note the following:
- Even though you already have SSL Certificates on your OES servers, you will probably need to delete/recreate them.
- You'll use iManager to recreate the server certificate.
- You'll use the standard "SSL CertificateDNS" certificate name that is used by many different services on OES.
- Creating the certificate, you will step through the process and NOT use the defaults. You must customize it.
- Select key type as SSL or TLS and Extended key usage as "Server authentication" and "User authentication". These are required attributes for CIS to function, and are not present on default certificates.
- In the "Subject Alternative Name" attributes, you MUST add the FQDN (Full DNS Name) of the server. This is a requirement for CIS to function.
- In the "Subject Alternative Name" attributes, I have ALSO sometimes had to add the servername (Just the hostname) for the certificate to be accepted. This is in addition to the FQDN. In other words, create a Subject Alternative Name for both formats.
- In order to not have to go through this process again for a while, choose the "Maximum Time" for the expiration of the certificate. Do not accept the 2 year default unless you want to redo these before you're even successful with your CIS implementation (lol).
- After you recreate the certificates, it's probably easiest to restart your server due to the number of services that utilize the certificate you just created.
Testing your Certificates
After creating the certificate and restarting the server, you should confirm from the Linux OES command line that your certificate contains the Subject Alternative Name. You can do this via this command:
- openssl x509 -in /etc/ssl/servercerts/servercert.pem -noout -text
This will display the detail of the certificate. It's imperative that somewhere in the output, you see the Subject Alternative Name. In the Subject Alternative Name, you must see the FQDN of the server.
If you do not want to restart the entire server, at least restart the following:
- systemctl restart ndsd.service
- systemctl restart oes-core-agent.service
- systemctl restart oes-dashboard-agent.service
OES Common Proxy and Server Registration to CIS Dashboard
OES servers use the OES Common Proxy to query LDAP for CIS location information. If the OES Common Proxy is broken, this LDAP query will fail. Therefore, it is imperative that the OES Common Proxy is working on any OES server with NSS volumes that you want to be managed by CIS. Failure to have a working Common Proxy will prevent the OES server from registering with the CIS Dashboard.
Scenario: One of your OES servers is not showing up in the CIS dashboard. The dashboardagent.log file shows a problem such as "unable to fetch CIS configuration". If you dig further you'll find that LDAP requests are failing. I have found two main reasons for a broken Common Proxy:
- Missing Common Proxy eDirectory Object
- Problem with Common Proxy Config on the OES server
Both of these problems are fairly simple to resolve. This is how you resolve each of them:
- Missing Common Proxy Object in eDirectory
- You can identify that this is an issue by looking in iManager for the common proxy object for your server. It should be named "OESCommonProxy_SERVERNAME". For example, if your server is named FS1, the Common Proxy is called "OESCommonProxy_FS1". If you do not see this object in the same container as your server, it is likely missing and needs to be repaired.
- The "Common Proxy Repair Script" (Next Item) will also report that the Common Proxy Edir Object is missing.
- Run this command to create the Common Proxy eDirectory object:
- Example: /opt/novell/proxymgmt/bin/move_to_common_proxy.sh -d cn=admin,o=treetop -w password -i 192.168.0.1 -p 636 -s all
Usage: move_to_common_proxy.sh [options]
-h Prints this summary
-d LDAP Admin FDN
-w LDAP Admin Password
-i LDAP Server IP address
-p LDAP Port
-s Service Name ('all' should be used to move all the services)
- Problem with Common Proxy Config on the OES server
- Download and run the Common Proxy Repair Script for OES 2018 (common-proxy-fix-1.5.sh). Download from this MicroFocus community page: https://community.microfocus.com/img/oes/w/oes_tips/21913/common-proxy-repair-script-for-oes2018-oes2015-and-oes11. Please also review the entire document to understand the purpose and how it works.
- Unzip the file, and then run the script.
- Generally speaking, it will likely be missing information or report a problem. It should identify the Default Common Proxy User correctly (option 1). Choose this and it will populate that same user to other files and locations on the OES server. That is generally all that is required to resolve this issue. You can run the script multiple times until it shows correctly from the start.
Usage: common-proxy-fix-1.0.sh -u ADMIN_DN [options]
-u ADMIN_DN Admin username in LDAP syntax (required)
-w ADMIN_PASS Admin password
-h LDAPS_IP LDAPS IP, default is 127.0.0.1
-p LDAPS_PORT LDAPS port, default is 636
-f SERVICE_LIST Comma separated list of services to force config for CASA
Example: common-proxy-fix-1.0.sh -u cn=admin,o=org
Example: common-proxy-fix-1.0.sh -u cn=admin,o=org -w P@ssw0rd
Example: common-proxy-fix-1.0.sh -u cn=admin,o=org -h 192.168.2.5 -p 1636
Example: common-proxy-fix-1.0.sh -u cn=admin,o=org -f cifs,dns,dhcp
Explanation from the Developers:
- There is a cisinfo LDAP attribute in the tree.
- Any agent who has to register with CIS will walk up the tree and read that attribute.
- The ldap connection is made using the OES Common Proxy.
- Once the server name is obtained, an http call is made to register with the CIS system.
Restart the Agents after resolving Common Proxy Problems
Restarting these agents will reload the necessary agents and should result in the server registering property.
- systemctl restart oes-dashboard-agent.service
- systemctl restart oes-core-agent.service
On the OES NSS file system servers, these are the log files used by the agents. The naming is somewhat self-explanatory.
- /var/opt/novell/log/cisagent/agent.log (related to the migration of files from OES to the cloud)
- /var/opt/novell/log/cisagent/cisscanner.log (related to the scanner jobs that run against the NSS volumes)
- /var/opt/novell/log/cisagent/recallagent.log (related to jobs where files are recalled from the cloud back to OES)
- /var/opt/novell/log/coreagent/coreagent.log (related to communication between OES and the CIS core services)
- /var/opt/novell/log/dashboardagent/dashboardagent.log (related to communication between OES and the CIS Dashboard)
Restarts the main agents for any given server and will reload the certificates that you had to recreate.
systemctl restart oes-core-agent.service oes-dashboard-agent.service
on OES File Server
systemctl status oes-core-agent.service (connection between core service and server)
systemctl status oes-dashboard-agent.service (the connection between server and dashboard)
systemctl restart oes-cis-agent.service oes-cis-recall-agent.service oes-cis-scanner.service
While a job is running, on the main CIS server
journalctl -f -u oes-cis-reporting-aggregator.service (This will show you the live logs of what is happening from that perspective).
journalctl -f -u oes-cis-data.service (This shows whether files are actually being uploaded to the Cloud Service)
On the main CIS server
journalctl -u oes-dashboard.service