DevOps in 2013 covers the current state of I.T. operations automation and the issues in the SDLC that need to be addressed in order to achieve continuous delivery:
DevOps in 2013 covers the current state of I.T. operations automation and the issues in the SDLC that need to be addressed in order to achieve continuous delivery:
Migrating to the Cloud: Making the Move
In Part II we discussed the strategy, architecture and roadmap for implementation. Here we elaborate the migration process. In this step “Making the Move”, the architectural specifications are used to enable the migration. The elements of the architecture should be moved based on the prioritized roadmap. Governance processes will guide the migration activities which are:
- Establish the Environments
- Develop the Migration Components
- Migrate to Test
- Move to Production
Establish the Environments
Create the provisioning templates for the instances. Determine if images will be “baked” or “fried” using automation tools such as Chef and Puppet, and cloud management tools like ServiceMesh. Instantiate cloud resources and provision the development test and production environments based on the specifications in the solution architecture.
Develop the Migration Components
As defined in the migration approach, rework the components specified in the architecture. All the lifting occurs in the development environment. Remember, for each component, think through the SLA’s and health aspects. Identify the method to determine if a component is healthy – and if not, document (or codify) how you will go about reviving the element. For each component migrated, there should be clear directions on how to maintain the item and ensure security, availability and responsiveness.
Migrate to Test
Once the components are built in the dev environments and tested, deploy them to test. There are multiple test plans, each serving a different purpose.
- Migration Tests verify if the migration scripts execute as expected
- Application Test Plan verifies the functioning of the software features of application in the reworked cloud environment
- Motivation Test Plan tests the critical success factors (CSRs) defined during the planning phase, measures the KPIs and validates that the true value from migrating to the cloud is realized
- User Acceptance Test Plan is the final validation by the business users to test the business use cases before sign-off
Move to Production
- Provision the resources in production and execute the migration scripts to move the application to its new target environment.
- Test application if possible by pointing database to legacy environment
- Migrate the data using cloud provider tools (AWS Import/Export) or third party tools (Tsunami)
- Enable monitoring and control
The migration step is last step in the process. All the lifting as defined by the solution architecture is done in this step and the entire solution is provisioned in its new target environment. The migration sets up the process of on-going maintenance and operations. By using a proper migration process, companies are able to significantly reduce their on-going operating costs. Proper migration costs more up-front, but typically pays for itself within the first year of operations.
In Part I we discussed the planning required in moving a legacy application to the cloud. Here we discuss the next step of establishing the foundation. In the planning phase, we began by understanding the driving factors for moving a legacy asset to the cloud. We conducted an asset analysis and gathered business and technical requirements specific to migration. We identified potential enhancements to the asset even though that may not have any bearing on the migration tasks. The foundation step consumes the planning requirements and produces a migration strategy and architecture. The key activities are:
- Select the Target Cloud Environment
- Define the Migration Approach
- Develop the Technical Architecture (Logical and Physical)
- Develop the Cloud Bill of Materials and Roadmap
Select the Target Cloud Environment
Identify a list of potential cloud solution providers. Understand the capabilities and limitations of the providers. Use criteria to evaluate vendors such as:
- History, reputation and client base
- Review vendor procedures to handle confidential information, address privacy concerns and enforce data security- both at rest and in motion. The provider should be a member of the Cloud Security Alliance and be SAS Level II certified
- Ensure there is no lock-in into proprietary programming models or services. Evaluate vendor tools and capabilities for deployment
- Validate ability to interoperate in a multi-cloud or a hybrid cloud environment
- Ensure pricing models are flexible, stable and meets the desired price performance needs
- Meeting SLA’s and providing a high degree of reliability and recovery from failures with minimal downtime
The process of selecting a service provider should follow the typical RFP/RFQ process.
Develop the Technical Architecture (Logical and Physical)
By now the current state, target environment and the requirements must be understood. Review the information gathered as a sanity check, especially if there has been a long gap since planning. Once requirements are signed-off, initiate the logical and physical architecture. The logical architecture includes:
- The target state identifying the elements such as servers, storage, load-balancers, firewalls, security groups, auto-scale groups, availability zones, disaster recovery and backup/restore
- The target security and compliance architecture (Virtual Private Cloud)
- Client connectivity to the cloud
- Specification of all architectural elements
- Mapping of software components to the servers
- Cloud services that should be leveraged by the solution
- Specifications for software components that require-rework
- Cloud implementation patterns applied to the solution
- Image Management, Customization, and Portability Considerations
- Environment Management and Governance
- Cross-Environment Networking and Latency Considerations
- Operations Management and Incident Response
- A rough cost or financial view of the architecture including estimates made in previous activity
Formal review of the logical technical architecture with the technical team including the application, infrastructure, business and the security and compliance architects, to poke holes and, test for completeness and resilience. Upon approval of logical architecture initiate physical architecture.
The physical architecture elaborates the logical view further and maps all the logical elements to the physical resources in the target environment. For example, if the service provider is AWS, the servers, services and components should be mapped to AWS provided machine images, resources and services.
Select the appropriate migration approach
Analyze the architecture to determine the best approach for migration.
- Determine if servers should be moved “as is” (forklift), rebuilt as part of migration or rebuilt with deployment/configuration automation.
- Determine if the software rework is necessary for the cloud or leverage a SaaS model that fits
- Secure all required licenses and identify gaps
- Determine the approach to migrate scripts, database procedures, policies etc. Think about automation using configuration management tools such as Chef or Puppet
- Determine the best approach to migrate the data
Update the financial views with the cost information.
Develop the Cloud Bill of Materials and prioritize
Develop a dependency tree showing the order in which the architecture elements will be migrated. Translate the dependency into a road map. Hold gate to review and approve the final technical architecture, final cost, effort and time.
The foundation step is important in establishing the architecture and developing a migration strategy and roadmap. A technical gate review assesses the solution architecture and provides approval to proceed as appropriate.
The move to cloud is on. However, some are struggling to decide where to begin and whether they are using the right cloud delivery life cycle methods. Vendor specific offerings often lock enterprises into proprietary or inflexible solutions. There are several pit-falls, and without careful discovery and planning, the value of the potential benefits will not be realized.
This post is the first in a three part series that discusses the three step approach used at MomentumSI to migrate applications to the cloud.
The Planning begins with:
- Identifying key business and technical drivers
- Analyzing the characteristics of the legacy asset
- Gathering business and technical requirements
- Approving the business case
Identify the key business and technical drivers
It is important to understand the motivation in terms of the issues being addressed and the potential benefits expected to be gained. For example:
- High operational costs in the current environment
- Expiry of data center lease or increase in lease costs
- Poor quality of service
- Workload spikes causing issues with scalability and performance
- Multiple instances with underutilized capacity
- Significant change or Functionality requires heavy customization
- Inability to quickly implement changes to support evolving business needs
- Complexity of reporting performance metrics and SLAs
Identify all tangible and intangible benefits that should result from the migration. Some examples of benefits are cost elasticity and system availability. Translate drivers into the critical success factors and key performance indicators to help measure if the values are being realized.
Assess Legacy Asset(s) in Scope
After the business motivation is clear, it’s time to begin with the migration analysis. Inventory the assets to be migrated and capture their attributes i.e. infrastructure, application and databases.
Identify all the application environments and their attributes including dev, test, lab, prod, DR and HA. If possible obtain documentation on the current state application architecture. Review trouble tickets to reveal existing issues with the application and infrastructure.
Check code documentation for other types of dependencies with specific frameworks, libraries, third party tools etc.
Determine the existing volume of data that needs to be migrated.
Determine users and roles that have to be moved. Identify all application specific configuration files and scripts, hidden dependencies such as “cron” jobs, and mechanisms for audit, error management and logging.
Inspect existing policies, procedures, triggers that may have to be moved. Determine the true cost for running the application in terms of server costs, support, data-center (power, space, incidentals using existing financial models or some approach).
Gather Business and Technical Requirements
When analyzing a system, one of the hardest elements is understanding the motivation for ‘why’ something was delivered the way it was. To understand the business and technical requirements, the analysis has to go deeper:
- Application dependencies on other applications both running in “BAU” mode or in the cloud
- Business process dependencies with other domains
- Migration impacts such as in-flight projects and programs that impact the application migration
- Legal, Compliance (PCI, SOX) , Security requirements
- Technical requirements for assets i.e. infrastructure, applications, databases and other assets
- Integration Technologies/Practices and Application Coupling
- Software Lifecycle Environments and Practices
- Virtualization Technology and Practices
- Disaster Recovery and Backup requirements
- Configuration and Release Management Practices
- Incident and Problem Management (prod and non-prod)
- Existing Data Center Components (focused on compute, network, and storage)
- Costs Related to Operating Current Environment
- Funding/Chargeback Approach
Not all items listed above require to be captured in depth other than those most relevant to the migration. A business case must be prepared and a gate review must be held to assess the opportunity and provide approval to proceed.
The planning step is very important in ensuring that the required level of breadth and depth of requirements have been gathered, the key drivers are understood and all the asset attributes have been captured adequately to create and assess the business case. At the end of the Planning phase, the business should understand 1. the scope of the migration, 2. a rough timeline for migration and 3. an initial estimate to make the move.
Relational databases, the dominant model for data storage and management for the past several decades, are not a good approach for solving some of the problems faced by today’s business. The broad success of phenomenon such as social networks, social media, mobile usage and evolution of the cloud has caused an explosion in the volume and type of data that has to be stored and managed.
NoSQL databases such as MongoDB and Cassandra have risen to the challenge. They allow horizontal scaling on commodity hardware through sharding and MapReduce, support high availability and have the flexibility to handle semi-structured data. 10gen offers a ruggedized version of MongoDB to go along with their support and training offerings. Mongolab offers MongoDB in the Cloud on platforms such as AWS.
Not only are we using NoSQL to solve the problems associated with large data volumes and semi-structured data for startups and ISVs, but we are using NoSQL to solve old problems such as integration, analytics and content management for enterprises. For some customers, we are moving application environments that use relational databases to use NoSQL databases in order to support new application architectures and agile development.
Beware! Relational databases may still be adequate or appropriate for your specific scenario. In which case, migrating to a NoSQL database makes no sense. For those that are considering migration to a NoSQL database, we have created a video and a presentation to help in your decision-making process. The video is complemented by a more detailed presentation. We welcome you to leverage this resource and let us know if you find it helpful. As always, feel free to contact us in case you need any assistance with your NoSQL initiatives and/or to help you realize broader cost-efficiences across your organization by leveraging NoSQL. Enjoy!
If you're Amazon, you have to start thinking about how to make sure it never happens again. Restore confidence... and fast. Here's what they said:
We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval. This would have prevented the ELB state data from being deleted in this event. This is a protection that we use across all of our services that has prevented this sort of problem in the past, but was not appropriately enabled for this ELB state data. We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.
Here's my question: If ITIL Service Transition (thoughtful change management) and DevOps (agile processes with infrastructure-as-code were to mate, what would the outcome be?
A) A child that wanted to run fast but couldn't because of too many manual/approval steps
B) A child that ran fast but only after the change board approved it
C) Mate multiple times; some children will run fast (with scissors) others will move carefully
D) No mating required; just fix the architecture (service recovery)
This is the discussion that I'm having with my colleagues. And to be clear, we aren't talking about what Amazon could/should do, we're talking about what WE should do with our own projects.
Although there's no unanimous agreement there has been some common beliefs:
1. Fix the architecture. I like to say that "cloud providers make their architecture highly available so we don't have to." This is an exaggeration, but if the cloud provider does their job right, we will have to focus less on making our application components HA and more about correctly using the providers HA components. There's little disagreement on this topic. AWS screwed up the MTTR on the ELB. We've all screwed up things before... just fix it.
2. Rescind dev-team access. So this is where it gets interesting. Remember all that Kumbaya between developers and operators? Gone. Oh shit - maybe we should have called the movement "DevTestOps"! One simple mistake and you pulled my access to production?? LOL - hell, yea. The fact is all services aren't created equal. I have no visibility into Amazon's internal target SLA's - but I'm going to guess that there are a few services that are five-9's (or 5.26 minutes of down-time per year). Certain BUSINESS CRITICAL services shouldn't be working in DevOps time. They should be thoughtfully planned out with Change Advisory Boards with Change Records and Release Windows by pre-approved Change Roles. Yes - if it's BUSINESS CRITICAL - pull out your ITIL manuals and follow the !*@$ing steps!
Again - there's little disagreement here. People who run highly available architectures know that to re-release something critical requires a special attention to detail. Run the playbook like your launching a nuclear missile: focus on the details.
To be clear, I love infrastructure-as-code. I think everything can be automated and it kills me to think about putting manual steps into tasks that we all know should run human-free. If your application is two-9's (3.6 days of down-time), automate it! Hell, give the developers access to production data - - you can fix it later! What about 99.9% uptime (8.76 hours)? Hmm... not so sure. What about 99.99% up-time? (52.56 minutes)? Well, that's not a lot of time to fix things if they go wrong. But wait - if I did DevOps automation correctly, shouldn't I be able to back out quickly? The answer is Yes - you SHOULD be able to run your SaveMyAss.py script and it MIGHT work.
Dev-to-Test = Use traditional DevOps & IaC (Infrastructure as Code)
Test-to-Stage = (same as above)
Stage-to-Prod (version 1) = (same as above)
Patch-Prod (99% up-time or less) = (same as above)
Patch-Prod (99.9% or greater up-time) = Run your ITIL checklist. Use your IaC scripts if you got'em.
For me, it's not an either/or choice between ITIL Transition Management and DevOps. IMHO, both have a time and a place. That said, I don't think that the answer is to inbreed the two - DevOps will get fat and be the loser in that battle. Keep agile agile. Use structure when you need it.
1. OpenStack continues to gain traction but many early adopters bypass Folsom in anticipation of Grizzly.
2. Amazon's push to the enterprise means we will see more hosted, packaged apps from Microsoft, SAP and other large ISV's. Their IaaS/PaaS introductions will be lackluster compared to previous years.
3. BMC and CA will acquire their way into the cloud.
4. SAP Hana will quickly determine that Teradata isn't their primary competitor as the rise of OSS solutions matures.
5. Data service layers (think Netflix/Cassandra) become common in large cloud deployments.
6. Rackspace, the "Open Cloud Company" continues to gain traction but users find more and more of their services 'not open'.
7. IBM goes another year without a cohesive cloud strategy.
8. Puppet and Chef continue to grow presence but Cfengine gets a resurgence in mindshare.
9. Cloud Bees, Rightscale, Canonical, Inktank, Enstratus, Piston Cloud, PagerDuty, Nebula and Gigaspaces are all acquired.
10. Eucalyptus sunsets native storage solutions and adopts OpenStack solutions.
11. VMware solution dominates over other CloudFoundry vendors.
12. Cloud 'cost control' vendors (Newvem, Cloudyn, Cloud Cruiser, Amysta, Cloudability, Raveld, CloudCheckR, Teevity, etc.) find the space too crowded and begin shifting focus.
13. PaaS solutions begin to look more and more like orchestration solutions with capabilities to leverages SDN, provisioned IOPS, IAM and autonomic features. Middleware vendors that don't offer open source solutions lose significant market share in cloud.
14. Microsoft's server-side OS refresh opens the door to more HyperV and private cloud.
15. Microsoft, Amazon and Google pull away from the pack in the public cloud while Dell, HP, AT&T and others grow their footprint but suffer growing pains (aka, outages).
16. Netflix funds and spins out a cloud automation company.
17. Red Hat focuses on the basics, mainly integrating/extending existing product lines with a continued emphasis on OpenStack.
18. Accenture remains largely absent from the cloud, leaving Capgemini and major off-shore companies to take the revenue lead.
19. EMC will continue to thrive: it's even easier to be sloppy with storage usage in the cloud and users realize it isn't 'all commodity hardware'.
20. In 2013, we'll see another talent war. It won't be as bad as dot-com, but talent will be tight.
I try to keep my predictions upbeat and avoid the forecasts on who will meet their demise - but yes, I anticipate a few companies will close doors or do asset sales. It's all part of the journey.
Enjoy your 2013!
- No disrespect to my friends on the HP Cloud team but I honestly believe that if Netflix were to have done a sudden switch from AWS to HP it would have brought HP Cloud to its knees. ELB’s (if they had them) would have been crushed and Internet gateways would have been overloaded. Finding a very large number of idle servers may have also been a challenge.
- In this imaginary scenario, I guess we’ll assume that Netflix decided to keep their movie library and all application services running on multiple clouds. Sure this would be expensive but it wouldn’t have been realistic for them to do a just-in-time copy of the data from one location to the other.
- Netflix has done a great job of publishing their technical architecture: EMR, ELB, EIP, VPC, SQS, Autoscale, etc. None of these are available in the solution Dianne prescribed (Stackato), nor does HP Cloud offer them natively. There is a complete mismatch of services between the clouds. CloudFoundry offers some things that are ‘similar’ but I’m concerned that they wouldn’t have offered performance at scale.
- Netflix has also created tools specific to the AWS cloud (Asgard, Astyanax, etc.) as well as tuned off-the-shelf tools for AWS like Cassandra. These would have to be refined to work on each target cloud.
If you try 3 different calculators, you will get three different ROI’s. We did not find a single calculator that provides a truly independent, vendor-neutral analysis of costs involved in hosting Windows and Linux based servers in the public and private clouds over an extended period. So, we built our own! And now, you can take advantage of this tool to analyze your own situation.
Our calculator can be used to compare the cost of running a private cloud in your own or leased data center, or at a large public cloud provider. If you already have a data center and own hardware, private cloud might be the right choice. However, if you need new gear or more space, public cloud will likely be the less expensive option. Give it a try and see what your costs might look like!
a) Yellow Cells are fields that you have to enter [or use the defaults when given]
b) Blue Cells are calculated values that you can modify, if needed
c) Green Cells represent calculated values
As one might expect, the Windows Report and the Linux Report have the cost comparisons between the private and public clouds – over a 5-year period.
How to Use?
Assumptions sheet – This captures the assumptions behind the calculations. The information is related to processor and data center costs, and some of your usage patterns. You can change some of the numbers here in case you have more precise data.
AWS Costs sheet – For the public cloud portion of the calculator, we have chosen pricing data for AWS, the leader both in market share and with cost-competitiveness. The AWS pricing data shown below is from Dec 2012. Do not modify this sheet – although you might want to review the numbers.
Assumptions sheet – This sheet has the raw data that you must enter to be able to generate the Windows and Linux Reports. Again, enter data only in the Yellow Cells and if needed, in the blue cells. The three areas in which information is gathered is:
a) the base cloud – your data center and hardware requirments
b) OS, IaaS and PaaS requirements
c) Anticipated headcount changes
Windows/Linux Report – These sheets show you the result of the calculations – a cost comparison between hosting Windows-based [or Linux-based] servers on the public [AWS] and private clouds. The breakdown of costs over a 5-year period are detailed in separate tables for the two clouds.
- Map reduce becomes the AWS Elastic MapReduce Service
- Dynamo and eventual consistency become AWS Dynamo / MongoDB-aaS
- Dremel becomes Google Big Query