-->

Welcome to our Coding with python Page!!! hier you find various code with PHP, Python, AI, Cyber, etc ... Electricity, Energy, Nuclear Power

Showing posts with label #IoT. Show all posts
Showing posts with label #IoT. Show all posts

Monday, 25 October 2021

Using Data To Make Better Decisions

All about Agile, Ansible, DevOps, Docker, EXIN, Git, ICT, Jenkins, Kubernetes, Puppet, Selenium, Python, etc


How many tennis balls could you fit in all of the skyscrapers in New York City? How many gummy bears could you fit in an airplane? How long would it take to fill the Mariana Trench with peanut butter? 

Like most people, you’d probably need some time and maybe a whiteboard to guess these answers. Not only are the questions difficult to imagine and rationalize, but also there’s a lot of relevant background information needed to be able to make an accurate guess.

Without collecting all relevant data, any answer is a guess.

Imagine watching any sports game where you could only see one team and no scoreboard. You would be able to guess how each team is doing from positioning and reactions, but it would be difficult to say confidently which team is winning or who is more likely to win the entire game. It would be even more difficult to win bets against someone else watching normally on TV being spoon-fed a plethora of statistics and hearing the opinions of professional commentators. 

Despite the unfairness of this competition, this is how many people invest in stocks. Dark Pools can consist of more than half of the overall trade volume for a stock, and yet despite this huge imbalance, many people don’t know they exist let alone monitor their activities. Gaining access and utilizing all the information available before making critical decisions can level the playing field and facilitate educated decisions.
Data can help us make sense of big numbers and simplify complex ideas. Pluto is roughly 3 billion miles away, so how long would it take to walk there? About a billion hours, which is longer than watching all the content on all major streaming sites back to back 20,000 times. While the number and the analogy mean the same thing, one is substantially easier to understand intuitively than the other. 

Data explanation and visualization is a crucial component of understanding complex data points.


Financial data is infamously one of the largest data sets in the world and endlessly complicated, so seeing it in easy-to-process graphs instead of raw metrics helps elucidate. It’s difficult to visualize two numbers of orders of magnitude apart, but it’s easy to see how big one circle is against another.
Likewise, it’s almost impossible to rationalize numbers without context for what they mean. Financial data is rife with jargon and acronyms that require a dictionary to read, let alone understand. Being able to process this information is knowing not just what the words mean, but also how it affects a company and how it compares to other similar companies. 

One of the best ways to use data is not as a standalone item in a complex sheet but as a living, breathing, dynamic guide for making better decisions. Data in isolation is hard to understand intuitively and even worse to try to act upon. Combined with visualizations, it can form a complete picture for faster understanding and superior intuitive answers.

Not all data is created equal.


This paradigm is far from original, with many of the most prominent data sources using citations and best practices to try to eliminate the uncertainty. One of the largest crypto data providers revealed that 65%-95% of all their data was inaccurate and untrustworthy. In the wake of the LIBOR scandal, cracks in the financial system were exposed and the underbelly of market manipulation was revealed. Recently, payment for order flow was popularized, selling people’s trades to big institutions and allowing them to take profit away from investors. Far from novel, these are just several examples of data being difficult to trust. 

The commonality from all three of these is a fundamental agency problem; each of these groups stood to gain financially from manipulating their data or failing to correct incorrect data. The solution is to find data sources that are fundamentally incentivized to provide accurate data. Most brokers provide access to financial data, but often there is a conflict of interest, such as a broker selling their own stock, or fees on trading either explicit or invisible through slippage. 

For this reason, it’s essential to carefully evaluate the trustworthiness of data sources and not take information at face value, especially when critical decisions are being made from it. Healthy skepticism and asking if there is a conflict of interest can expose early on whether a data set is purely analytical or might be inaccurate and skewed.

Data is one of the most valuable resources in the world.


Almost half of the top seven biggest companies in the world use data as their primary product for good reason. Making informed, objective decisions can eliminate uncertainty on correct choices and often guarantee the best possible outcomes. Learning to make these decisions off of comprehensive, well understood and trustworthy data can drastically increase the effectiveness of any decision and bring order to an otherwise chaotic world.

Tuesday, 19 October 2021

TOP IOT TRENDS AND PREDICTIONS TO LOOK OUT FOR IN 2022

All about Agile, Ansible, DevOps, Docker, EXIN, Git, ICT, Jenkins, Kubernetes, Puppet, Selenium, Python, etc

A glimpse through the top trends and predictions in IoT that is about to occur in 2022

Internet of Things (IoT) is a popular topic for discussion among individuals working in the tech and in the manufacturing industry. IoT is a hyped-up futuristic technology and is also known for its social and technological wonders. The latest IoT trends can attest to the fact that this technology has revolutionized several industries. It has helped businesses improve processes, boost earnings, and provide better customer services.

IoT and its smart sensors have already paved the way for automation enhancements in different sectors, along with the increasing use of the cloud and the development of 5G. The emerging trends in IoT are majorly driven by artificial intelligence, edge computing, blockchain, and others. This article talks about some of the top trends and predictions of IoT that will change the tech world in 2022.

 

Top IoT Trends and Predictions for 2022

Advanced Security

The IoT market will witness a renewed focus on security. Experts believe that by 2022, IoT network hacking will become a common phenomenon. Network operators will act as cybersecurity personnel and prevent intruders from causing any harm. Since cyberattacks have become so common lately, companies like Sierra, Wireless, and others who have been cyber-attacked previously have adopted IoT-driven cyber tools. Other companies, like Ericsson, Microsoft, and U-Blox, have come up with their own threat detection and security service tools.

 

Analytics, Machine Learning, other Disruptive Technologies

IoT analytics and data are detecting applications in IoT networks. These systems trigger alerts while transferring volumes of data to the network cores. Integrated analytics are now being deployed into solutions as providers and business leaders want to accelerate data analysis. These systems support IoT devices, processes, applications, and infrastructure adoption and optimization, ensuring improved performances as networks operate in a low latency environment.

 

IoT Transforming Business Models

There are several proofs, reports, and examples of the successful implementation of IoT in business models, where it has enhanced the performance of the company by increasing its output and by improving other business metrics and objectives. With the help of automation, manufacturers are transforming their entire business models into more innovative and productive ones.

 

AR and VR with IoT

Virtual and augmented reality with IoT can bind together the physical and the digital worlds. It brings in an opportunity for the application of IoT data in AR and VR technologies. Implementing IoT in these technologies brings economic benefits like reduced costs and several new and increased profits and opportunities. Combining these technologies might help companies educate employees about the virtual prototypes of products, types of equipment and can also help contemplate various strategies to determine business growth.

 

Adoption of IoT in Healthcare

The healthcare industry has been experimenting with IoT technologies for years now. This industry is leading in IoT adoption and innovation. Different types of equipment like wearable sensors and devices, tracking, and indoor navigation tech are discovered using IoT. It can also enhance lighting systems by linking them to health monitors and deploying sensors.

 

The Emergence of Smart Cities

The establishment of smart cities will be the result of IoT and edge computing technologies. Experts are discussing the next innovations in smart digital connectivity. Some cities in the United States are already contemplating connecting utilities, parking meters, and traffic lights to IoT networks. The IoT market is expected to grow up to US$639.74 billion by 2022. Smart cities will not only improve the social living standards but will also benefit citizens from economic aspects.

 

Boost in Customer Service

IoT technologies have massively impacted customer services providing information. IoT can improve the power of CRM systems, which helps them to detect customer problems and report them to the companies. These systems even allow the companies to join customer discussions for improved consumer engagement, which eventually leads to customer retention. The increasing use of IoT can be witnessed during the pandemic to enforce social distancing measures.

Wednesday, 13 October 2021

You Can Use Artificial Intelligence to Take Your Presentations to the Next Level

All about Agile, Ansible, DevOps, Docker, EXIN, Git, ICT, Jenkins, Kubernetes, Puppet, Selenium, Python, etc

The ability to create killer presentations is an essential skill in the modern workplace. Whether you’re making a pitch to potential investors or business partners, presenting at an industry conference, or simply trying to communicate ideas to your coworkers, a compelling presentation can get people to engage and buy-in, which will ultimately advance your career. So what is the most important factor for a compelling presentation? Obviously, solid public speaking skills are a must. However, in today’s technology-driven world, it is even more important to master the art of visual communication. Research has shown that presentation efficacy is 7-percent content, 38-percent voice, and 55-percent visual. That’s why there are so many design tools and deck templates out there. However, if you want a solution that truly takes your presentations to the next level, you need to check out Beautiful.ai.

WHAT MAKES A GOOD SLIDE DECK?

Presentations
Image via Beautiful.ai

When it comes to visual communication, there are really five basic qualities you want your slide presentation to have. It should feature minimal text that supports rather than repeats what you are saying. It should feature slides with a variety of layouts to avoid monotony and maintain visual interest. It should feature beautiful visuals such as graphs, charts, statistics, and images that support your key takeaways. It should be meticulously consistent and coherent in style and formatting. And it should stay on brand by incorporating your company or organization’s color palette and logos.

Unfortunately, actually creating presentations with these qualities is easier said than done. While a whopping 91-percent of presenters say they feel more confident when they have a well-designed slide deck, about 45-percent find it difficult to design effective layouts, 41-percent find it difficult to find or implement good visuals, 47-percent say they generally spend more than 8 hours on design, 35-percent choose bright vibrant colors instead of sticking with brand colors, and 7-percent say they even struggle to choose a good font.

Clearly, the usual way of creating presentations is not very efficient. Luckily, it doesn’t have to be this way thanks to Beautiful.ai.

MEET BEAUTIFUL.AI

presentations
Image via Beautiful.ai

Beautiful.ai is an automated presentation design platform that helps you create captivating slide decks while saving yourself hours of busywork, transforming your ideas into visual stories in just a few minutes.

The key feature of the Beautiful.ai platform is its built-in AI and hundreds of smart slide templates. It’s users don’t need to spend any time studying effective design layout. The Beautiful.ai AI designer handles all the heavy lifting, helping you autoformat your slides and make the best possible choices for your presentation. Every slide and every template is fully customizable, offering you dozens of possible tweaks to help you stay on brand. And there are millions of free photos and icons at your fingertips. And perhaps importantly, Beautiful.ai’s simple and intuitive menus and controls dramatically minimize the learning curve, so you spend less time learning how to use it and more time fine-tuning your message.

Of course, while the quality of the presentations you will create with Beautiful.ai is fantastic, the best part about this platform might just be the price. Unlike a lot of other professional productivity tools out there, you won’t have to go through a round of venture financing just to afford it. The top tier subscription is probably less than your team spends on coffee in a day. And anybody can try Beautiful.ai for free.

TRY IT FREE FOR YOUR UPCOMING PRESENTATIONS

presentations
Image via Beautiful.ai

If you’re ready to dial your presentations up a notch, give yourself the tools that dozens of top companies are already using. Check out Beautiful.ai today.

Futurism fans: To create this content, a non-editorial team worked with an affiliate partner. We may collect a small commission on items purchased through this page. This post does not necessarily reflect the views or the endorsement of the Futurism.com editorial staff.


Care about supporting clean energy adoption? Find out how much money (and planet!) you could save by switching to solar power at UnderstandSolar.com. By signing up through this link, Futurism.com may receive a small commission.

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Are Migrating
#MachineLearning #Python #IoT #flutter #100DaysOfMLCode #programming #AI #javascript #Serverless #CodeNewbie #DataScience #100DaysOfCode #deeplearning #Database #mlops
Apache Airflow is a popular open-source tool that helps teams create, schedule, and monitor sequences of tasks, known as “workflows.”. In particular, data engineers utilize Airflow to help manage a lot of their data pipelines due to its ability to easily deploy and manage complex tasks in what is often referenced as a DAG( which we will discuss shortly).
read more here:


x

Source

Apache Airflow is a popular open-source tool that helps teams create, schedule, and monitor sequences of tasks, known as “workflows.”

In particular, data engineers utilize Airflow to help manage a lot of their data pipelines due to its ability to easily deploy and manage complex tasks in what is often referenced as a DAG( which we will discuss shortly)

Managed Workflows for Apache Airflow, abbreviated as MWAA, is a managed service designed for orchestration in Apache Airflow. With MWAA, users can easily operate data pipelines at scale, setting up and managing them from end to end.

What isn’t always discussed in articles about Airflow is the struggle to manage Airflow systems. They can be challenging to scale, overload VMs and have scheduler get stuck.

When this isn’t managed by a service like MWAA or Astronomer.io, it can pose a lot of extra work.

In this community update we wanted to discuss MWAA and its benefits to see if it’s the right choice for your team.

What Is Apache Airflow?

Apache Airflow is a powerful platform that helps teams manage their workflows programmatically. With Airflow, sequences of tasks are effortlessly turned into Directed Acyclic Graphs (DAGs). Using the Airflow scheduler, users can set up dependencies to delegate tasks automatically to workers, ensuring reliable and fast execution.

Even when working with complex DAGs, changes are made easily thanks to a robust set of command line utilities. With an advanced user interface, visualizing pipelines is simple, allowing teams to monitor production, progress and problems easily. By defining workflows as code, maintaining, versioning, testing, and collaborating with them is made simpler. All of that makes Apache Airflow a fantastic platform for workflow orchestration, and MWAA only makes it better.

What Is Managed Workflows for Apache Airflow?

Managed Workflows allow developers the ability to quickly deploy an Airflow instance on AWS that utilizes a combination of other AWS services to optimize the overall set-up.

Scalability, availability and security are assured with Managed Workflow’s automatic orchestration, which will scale capacity to meet your needs with limited intervention.

This is unlike a more manual approach which would be to use an EC2 instance to run DAGs that you either store in S3 or on the EC2 instance itself.

The managed service means that you no longer need to monitor and manually scale your Celery workers to meet your workflows’ demand.

MWAA was set up to use only CeleryExecutor with and, more importantly, has an autoscaling mechanism implemented under the hood. In many of my consulting projects I have come into plenty of projects where a client had a broken Airflow instances that wouldn’t have broken if they had MWAA.

Combined with integrated AWS security, this will allow teams to enjoy fast data access with complete peace of mind.

Now let’s take a closer look at some benefits that come along with using Managed Workflows for Apache Airflow.

Rapid Airflow Deployment at Scale

Deploying at scale is fast and easy. Choose from AWS Management Console, AWS CloudFormation, AWS SDK or CLI. Once you create an account, you can begin deploying DAGs (Directed Acyclic Graphs) directly to your Airflow environment. There’s no need to wait for someone to provision infrastructure or gather development resources.

Improved Logging

On one consulting project I came across had an Airflow Scheduler that stopped working. The reason?

Their logs were being stored locally and caused the VM to crash.

One, you shouldn’t store your logs locally. Two why even have to worry about where your logs are stored?

MWAA is configured to use CloudWatch for logging. Setting this up manually would be tedious. It would require a developer to configure a CloudWatch Agent on all instances to stream the logs to CloudWatch, and ensure proper log groups for all components. This allows for a lot of errors.

But what if it just was automatically set up.

Well that’s what MWAA will do.

Built-In Airflow Security

Security is a critical concern to modern businesses, but Managed Workflows gives teams peace of mind. With Managed Workflows, you can be confident knowing that your workloads are secure by default. All workloads will run in an isolated and secure environment in the cloud, making use of the Virtual Private Cloud (VPC) Amazon offers. Additionally, all data is encrypted automatically with Amazon’s Key Management Service (KMS).

If you want to control authentication and authorization based on user role, you can do so by tapping into Apache Airflow’s user interface and navigating to the Identity and Access Management (IAM) area. Security is further assured by giving users Single-Sign-On (SSO) access when they need to schedule or view a workflow execution.

Reduced Operational Costs

As a managed service, Managed Workflows helps to cut back on the intensive manual labor that’s typically associated with running Apache Airflow at scale. By cutting back on the heavy lifting, Managed Workflows help bring down engineering overhead and reduce overall operational costs through quicker deployment, less manual input and the on-demand monitoring that’s necessary to orchestrate an end-to-end data pipeline optimally.

Choose The Best Plugin

Managed Workflows give teams flexibility in that they can choose to use a pre-existing plugin or use their own. You can always connect to any AWS plugins available, but Managed Workflows also gives you the opportunity to use on-premises resources to run your workflows. Athena, Fargate, Lambda and Redshift are among the most popular examples, but you’ll also find Batch, Cloudwatch, Firehouse, SNS and countless others on the list.

How It Works

Once your team is set up with Apache Airflow, making the move to Managed Workflows for Apache Airflow will prove effortless. MWAA uses Python, which is where the Directed Acyclic Graphs (DAGs) are written. It’s those DAGs that help orchestrate and schedule the workflows your team creates.

To set up Managed Workflows, you simply need to assign an S3 bucket to it, which is where the Python dependencies list, DAGs and plugins will be stored. You can upload to the bucket using a code pipeline or manually in order to describe your ETL and learn process for automation. Lastly, you can begin running and monitoring your DAGs using the user interface of Airflow, CLI, or SDK.

So Is MWAA For You?

With all of that in mind, the question to ask is simple: Is Managed Airflows right for your brand?

There are some limitations when utilizing MWAA between using an unmanaged Airflow system. For example, currently MWAA is limited to version 2.0.2 where Airflow is on version 2.1.4.

Using any managed service will always provide some limited flexibility, but for a trade-off of reducing costs such as dealing with planning how to scale out larger systems and requiring more employees just to manage operations.

If you’re considering adopting Apache Airflow for the first time, or if you’re already using Airflow and look for a way to automate your workflows, MWAA could be a wise addition to your stack.

Thank you for reading!

Wednesday, 6 October 2021

Agile is a mindset. Agile is behaviour.

All about Agile, Ansible, DevOps, Docker, EXIN, Git, ICT, Jenkins, Kubernetes, Puppet, Selenium, Python, etc
Ahmed Sidkey’s agile mindset image is one I remember fondly when I was presenting at the Agile Alliance conference 2014 in Florida USA. Sidkey depicted a continuum from values to principles and practices.

After seeing it again a few years later when I was presenting at the Agile NZ 2016 conference, and looking at it’s linear flow, I’m now not so sure.

I’ve recently talked with some people across a large program who are now focussing on being agile. I think it’s amazing that they are considering what it means to be truly agile so early in their agile evolution, rather than turning agile frameworks into yet another mindless, follow the proverbial bouncing ball, process and methodology. This form of doing agile, turning it into yet another corporate process, just doesn’t reap any significant benefits in my experience.

An Agile mindset is the combinations of actions and behaviours that result in an agile culture. Encompasses  values, principles, and a disciplined focused approach to using the agile framework as part of contemporary way of working. It is a shift from linear plan driven ways of working towards an adaptive, value driven, customer centric approach.

This mindset is the environment within which agile teams flourish. It isn’t a prerequisite for an agile adoption, nor is it required for a functional agile team. But if this mindset is cultivated and nourished, the teams (and therefore the company) will experience amazing results – happy employees delivering great value and making customers elated with the results.

How to you measure Agile Mindset?

Agile IQ® assesses software and non-software teams on both their actions and behaviours.







x

But as I dug deeper into what people were doing when they said they were being agile, I’ve found the strangest things:

  • Scrum is too hard, so we’ll just say “we’re being agile” and everything will be ok.
  • We’ll reinforce we’re being pragmatic about its use and we “take what works”, even though we have no real experience to base our judgement on “what works”.
  • Being agile is somehow superior to doing agile, so I’m better than you.
  • Agile has too many meetings, so I’ll just say we’re “being agile” and that will excuse us from being in Sprint Planning or going to the Daily Scrum.

Ultimately, most often, what was being heralded as an Agile Mindset was really an excuse not to change the way they worked. They labelled things as “agile” without actually exhibiting any of the expected behaviours.

The strange push of this Agile Mindset phenomenon seemed to fast becoming the Descartes of agile: “I think I am agile, therefore I am”.

Perhaps it was just positive thinking. Maybe if they thought they were agile then they would be agile. 

“The problem with positive thinking is when it disconnects you from reality. If you have achieved your goals in your mind’s eye, studies show you are less likely to consider the concrete actions you need to take and the possible obstacles in the way.”

Tuesday, 5 October 2021

MLOps essentials: four pillars for Machine Learning Operations on AWS

When we approach modern Machine Learning problems in an AWS environment, there is more than traditional data preparation, model training, and final inferences to consider. Also, pure computing power is not the only concern we must deal with in creating an ML solution.

There is a substantial difference between creating and testing a Machine Learning model inside a Jupyter Notebook locally and releasing it on a production infrastructure capable of generating business value. 

The complexities of going live with a Machine Learning workflow in the Cloud are called a deployment gap and we will see together through this article how to tackle it by combining speed and agility in modeling and training with criteria of solidity, scalability, and resilience required by production environments.

The procedure we’ll dive into is similar to what happened with the DevOps model for "traditional" software development, and the MLOps paradigm, this is how we call it, is commonly proposed as "an end-to-end process to design, create and manage Machine Learning applications in a reproducible, testable and evolutionary way".

So as we will guide you through the following paragraphs, we will dive deep into the reasons and principles behind the MLOps paradigm and how it easily relates to the AWS ecosystem and the best practices of the AWS Well-Architected Framework.

Let’s start!

Why do we need MLOps?

As said before, Machine Learning workloads can be essentially seen as complex pieces of software, so we can still apply "traditional" software practices. Nonetheless, due to its experimental nature, Machine Learning brings to the table some essential differences, which require a lifecycle management paradigm tailored to their needs. 

These differences occur at all the various steps of a workload and contribute significantly to the deployment gap we talked about, so a description is obliged:

Code

Managing code in Machine Learning appliances is a complex matter. Let’s see why!

Collaboration on model experiments among data scientists is not as easy as sharing traditional code files: Jupyter Notebooks allow for writing and executing code, resulting in more difficult git chores to keep code synchronized between users, with frequent merge conflicts.

Developers must code on different sub-projects: ETL jobsmodel logictraining and validationinference logic, and Infrastructure-as-Code templates. All of these separate projects must be centrally managed and adequately versioned!

For modern software applications, there are many consolidated Version Control procedures like conventional commit, feature branching, squash and rebase, and continuous integration

These techniques however, are not always applicable to Jupyter Notebooks since, as stated before, they are not simple text files.

Development

Data scientists need to try many combinations of datasets, features, modeling techniques, algorithms, and parameter configurations to find the solution which best extracts business value

The key point is finding ways to track both succeeded and failed experiments while maintaining reproducibility and code reusability. Pursuing this goal means having instruments to allow for quick rollbacks and efficient monitoring of results, better if with visual tools.

Testing

Testing a Machine Learning workload is more complex than testing traditional software. 

Dataset requires continuous validation. Models developed by data scientists require ongoing quality evaluation, training validation, and performance checks

All these checks add to the typical unit and integration testing, defining the concept of Continuous Training, which is required to avoid model aging and concept drift

Unique to Machine Learning workflows, its purpose is to trigger retraining and serving the models automatically.

Deployment

Deployment of Machine Learning models in the Cloud is a challenging task. It typically requires creating various multi-step pipelines which serve to retrain and deploy the models automatically. 

This approach adds complexity to the solution and requires automating steps done manually by data scientists when training and validating new models in a project's experimental phase. 

It is crucial to create efficient retrain procedures!

Monitoring in Production

Machine Learning models are prone to decay much faster than "traditional" software. They can have reduced performances due to suboptimal coding, incorrect hardware choices in training and inference phases, and evolving data sets.

A proper methodology must take this degradation into account; therefore, we need a tracking mechanism to summarize workload statistics, monitor performancesand send alarm notifications

All of these procedures must be automated and are called Continuous Monitoring, which also has the added benefit of enabling Continuous Training by measuring meaningful thresholds.

We also want to apply rollbacks when a model inference deviates from selected scoring thresholds as quickly as possible to try new feature combinations.

Continuous Integration and Continuous Deployment

Machine Learning shares similar approaches to standard CI/CD pipelines of modern software applications: source control, unit testing, integration testing, continuous delivery of packages. 

Nonetheless, models and data sets require particular interventions.

Continuous integration now also requires, as said before, testing and validating data, data schemas, and models.

In this context, continuous delivery must be designed as an ML training pipeline capable of automatically deploying the inference as a reachable service.

As you can see, there is much on the table that makes structuring a Machine Learning project a very complex task. 

Before introducing the reader to the MLOps methodology, which puts all these crucial aspects under its umbrella, we will see how a typical Machine Learning workflow is structured, keeping into account what we have said until now.

Let’s go on!

A typical Machine Learning workflow in the Cloud

A Machine Learning workflow is not meant to be linear, just like traditional software. It is mainly composed of three distinct layers: datamodel, and code, and one will continuously give and retrieve feedback from others

So while with traditional software, we can say that each step that composes a workflow can be atomic and somehow isolated, in Machine Learning, this is not entirely true as the layers are deeply intertwined

A typical example is when changes to the data set require retraining or re-thinking a model. A different model also usually needs modifications to the code that runs it.

Let’s see together what every Layer is composed of and how it works.

The Data layer

The Data layer comprises all the tasks needed to manipulate data and make it available for model design and training: data ingestiondata inspection, cleaning, and finally, data preprocessing.

Data for real-world problems can be in the numbers of GB or even TB, continuously increasing, so we need proper storage for handling massive data lakes. 

The storage must be robust, allow efficient parallel processing, and integrate easily with tools for ETL jobs.

This layer is the most crucial, representing 80% of the work done in a Machine Learning workflow; two famous quotes state this fact: "garbage in, garbage out" and "your model is only as good as your data.” 

Most of these concepts are the prerogative of a Data Analytics practice, deeply entangled with Machine Learning, and we will analyze them in detail later on in this article.

The Model layer

The Model layer contains all the operations to designexperimenttrain, and validate one or more Machine Learning models. ML practitioners conduct trials on data in this layer, try algorithms on different hardware solutions, and do Hyperparameters tuning.

This layer is typically subject to frequent changes due to updates on both Data and Code, necessary to avoid concept drift. To properly handle its lifecycle management at scale, we must define automatic procedures for retraining and validation.

The Model layer is also a stage where discussions occur, between data scientists and stakeholders, about model validation, conceptual soundness, and biases on expected results.

The Code layer

In the Code layer, we define a set of procedures to put a model in production, manage inferences requests, store a model's metadata, analyze overall performancesmonitor the workflow (debugging, logging, auditing), and orchestrate CI/CD/CT/CM automatisms.

A good Code layer allows for a continuous feedback model, where the model evolves in time, taking into account the results of ongoing inferences.

All these three layers are managed by "sub-pipelines," which add up to each other to form a "macro-pipeline" known as Machine Learning Pipeline

Automatically designing, building, and running this Pipeline while reducing the deployment gap in the process is the core of the MLOps paradigm. 

MLOps on AWS: the four pillars

MLOps aims to make developing and maintaining Machine Learning workflows seamless and efficient. The data science community generally agrees that it is not a single technical solution, yet a series of best practices and guiding principles around Machine Learning.

An MLOps approach involves operations, techniques, and tools, which we can group into four main pillarsCollaborationReproducibilityContinuity, and Monitoring

We will now focus on each one, giving multiple practical examples that show how AWS, with many of its services, can be an invaluable tool to develop solutions that adhere to the paradigm’s best practices.

Collaboration

A good Machine Learning workflow should be collaborative, and collaboration occurs on all the ML pipelines.

Starting from Data Layer, we need a shared infrastructure, which means a distributed data lake. AWS offers several different storage solutions for this purpose, like Amazon Redshift, which is best for Data Warehousing, or Amazon FSx for Lustre, perfect as a distributed file system. Still, the most common service used for data lake creation is Amazon S3

To properly maintain a data lake, we need to regularly ingest data from different sources and manage shared access between collaborators, ensuring data is always up-to-date

This is not an easy task, and for that, we can take advantage of S3 LakeFormation, a managed service that helps in creating and maintaining a data lake, by working as a wrapper around AWS Glue and Glue Studio, in particular simplifying Glue’s Crawler set-up and maintenance.

S3 LakeFormation can also take care of data and collaborators' permission rules by managing users and roles underneath AWS Glue Catalog. This feature is crucial as collaboration also means maintaining governance over the data lake, avoiding unintended data manipulation by allowing or denying access to specific resources inside a catalog.

For the model layer, data scientists need a tool for collaborative design and coding of Machine Learning models. It must allow multiple users to work on the same experiment, quickly show the results of each collaborator, grant real-time pair programming, and avoid code regressions and merge conflicts as much as possible.

SageMaker is the all-in-one framework of choice for doing Machine Learning on AWS, and Amazon SageMaker Studio is a unique IDE explicitly developed for working with Jupyter Notebooks having collaboration in mind.

SageMaker Studio allows sharing a dedicated EC2 instance between different registered users, in which it is possible to save all the experiments done while developing a Machine Learning model. This instance can host Jupyter Notebooks directly or receive results, attachments, and graphics via API from other Notebook instances. 

SageMaker Studio is also directly integrated with SageMaker Experiments and SageMaker Feature Store.

The first one is a set of API that allows data scientists to record and archive a model trial, from tuning to validation, and report the results in the IDE console. The latter is a purpose-built managed store for sharing up-to-date parameters for different model trials.

SageMaker Feature Store represents a considerable step forward in maintaining governance over data parameters across different teams, mainly because it avoids a typical misused pattern of having different sets of parameters for training and inference. It is also a perfect solution to ensure that every data scientist working on a project has complete labeling visibility.

Reproducibility

To be robust, fault-tolerant, and scale properly, just like "traditional" software applications, a Machine Learning workflow must be reproducible.

One crucial point we must address with care, as we said before, is Version Control: we must ensure code, data, model metadata, and features are appropriately versioned. 

For Jupyter Notebooks, Git or AWS CodeCommit are natural choices, but managing the information of different trials, especially model metadata, requires some considerations.

We can use SageMaker Feature Store for metadata and features. It allows us to store data directly online in a managed store or integrate with AWS Glue (and S3 LakeFomation). It also enables data encryption using AWS KMS and can be controlled via API or inside SageMaker Studio.

When you want a workflow to be reproducible, you also mean experimenting on a larger scale, even in parallel, in a quick, predictable, and automatic way.

SageMaker offers different ways to mix and match different Machine Learning algorithms, and AWS allows for three possible approaches for executing a model.

Managed Algorithm: SageMaker offers up to 13 managed algorithms for common ML scenarios, and for each one, detailed documentation describes software and hardware specifications.

Bring your own algorithmdata scientists can quickly introduce custom logic on notebooks, as long as the model respects SageMaker fit() requirements.

Bring your own Containerparticular models such as DBScan require custom Kernels for running the algorithm, so SageMaker allows registering a custom container with the special Kernel and the code for running the model.

Data Scientists can tackle all these approaches together. 

SageMaker gives the possibility to define the hardware on which running a model training or validation by selecting the Instance Type and the Instance Size in the model properties, which is extremely important as different algorithms require CPU or GPU optimized machines. 

To fine-tune a model, SageMaker can run different Hyperparameter Tuning StrategiesRandom Search and Bayesian Search. These two strategies are entirely automatic, granting a way to test a more significant number of trial combinations in a fraction of time.

To enhancing the repeatability of experiments, we also need to manage different ways of doing data preprocessing (different data sets applied to the same model). For this, we have AWS Data Wrangler, which contains over 300 built-in data transformations to quickly normalize, transform, and combine features without having to write any code.

AWS Data Wrangler can be a good choice when the ML problem you’re addressing is somehow standardized, but for most cases, the datasets are extremely diverse, which means tackling ETL jobs on your own. 

For custom ETL jobs, AWS Glue is still the way to go, as it also allows saving Job Crawlers and Glue Catalogs (for repeatability). Along with AWS Glue and AWS Glue Studio, we have also tried AWS Glue Elastic Views, a new service to help to manage different data sources together.

Continuity

To make our Machine Learning workflow continuous, we must use pipeline automation as much as possible to manage its entire lifecycle.

We can break the entire ML workflow into three significant pipelines, one for each Machine Learning Layer.

Data engineering pipeline

The Data pipeline is composed of IngestionExplorationValidationCleaning, and Splitting phases. 

The Ingestion phase on AWS typically means bringing raw data to S3, using any available tool and technology: direct-API access, custom Lambda crawlers, S3 LakeFormation, or Amazon Kinesis Firehose

Then we have a preprocessing ETL phase, which is always required

AWS Glue is the most versatile among all the available tools for ETL, as it allows reading and aggregating information from all the previous services by using Glue Crawlers. These routines can poll from different data sources for new data.

We can manage Exploration, Validation, and Cleaning steps by creating custom scripts in a language of choice (e.g., Python) or using Jupyter Notebook, both orchestrated via AWS Step Functions

AWS Data Wrangler represents another viable solution, as it can automatically take care of all the steps and connect directly to Amazon SageMaker Pipelines.

Model pipeline

The Model pipeline consists of TrainingEvaluationTesting, and Packaging phases.

These phases can be managed directly from Jupyter Notebook files and integrated into a pipeline using AWS StepFunctions SageMaker SDK, which allows calling SageMaker functions inside a StepFunction script.

This exploit gives extreme flexibility as it allows to:

  1. Quickly start SageMaker training jobs with all the configured parameters.
  2. Evaluate models using SageMaker pre-build evaluation scores.
  3. Run multiple automated tests directly from code.
  4. Record all the steps in SageMaker Experiments.

Having the logic of this Pipeline on Jupyter Notebooks has the added benefit of having everything versioned and easily testable.

Packaging can be managed through Elastic Container Registry APIs, directly from a Jupyter Notebook or an external script. 

Deployment pipeline

The Deployment Pipeline runs the CI/CD part and is responsible for taking models online during the TrainingTesting, and Production phases. A key aspect during this Pipeline is that the demand for computational resources is different for all three stages and changes over time.

For example, training will require more resources than testing and production at first, but later on, as the demand for inferences will grow, production requirements will be higher (Dynamic Deployment).

We can apply Advanced deployment strategies typical of "traditional" software development to tackle ML workflows, including A/B testing, canary deployments, and blue/green deployments.

Every aspect of deployment can benefit from Infrastructure as Code techniques and a combination of AWS services like AWS CodePipeline, CloudFormation, and AWS StepFunctions.

Monitoring

Finally, good Machine Learning workflows must be monitorable, and monitoring occurs at various stages.

We have performance monitoring, which allows understanding how a model behaves in time. By continuously having feedback based on new inferences, we can avoid model aging (overfitting) and concept drift.

SageMaker Model Monitor helps during this phase as it can do real-time monitoring, detecting biases and divergences using Anomaly Detection techniques, and sending alerts to apply immediate remediation. 

When a model starts performing lower than the predefined threshold, our pipeline will begin a retraining process with an augmented data set, consisting of new information from predictions, different Hyperparameters combinations, or applying re-labeling on the data set features.

SageMaker Clarify is another service that we can exploit in the monitoring process. It detects potential bias during data preparation, model training, and production for selected critical features in the data set. 

For example, it can check for bias related to age in the initial dataset or in a trained model and generates detailed reports that quantify different types of possible bias. SageMaker Clarify also includes feature importance graphs for explaining model predictions.

Debugging a Machine Learning model, as we can see, is a long, complex, and costly process! There is another useful AWS service: SageMaker Debugger; it captures training metrics in real-time, such as data loss during regression, and sends alerts when anomalies are detected

SageMaker Debugger is great for immediately rectifying wrong model predictions.

Logging on AWS can be managed on the totality of the Pipeline using Amazon CloudWatch, which is available with all the services presented. Cloudwatch can be further enhanced using Kibana through ElasticSearch to have an easy way to explore log data.

We can also use CloudWatch to trigger automatic rollback procedures in case of alarms on some key metrics. Rollback is also triggered by failed deployments.

Finally, the reproducibility, continuity, and monitoring of an ML workload enables the cost/performance fine-tuning process, which happens cyclically across all the workload lifecycle. 

Sum Up

In this article, we’ve dived into the characteristics of the MLOps paradigm, showing how it took concepts and practices from its DevOps counterpart to allow Machine Learning to scale up to real-world problems and solve the so-called deployment gap.

We’ve shown that, while traditional software workloads have more linear lifecycles, Machine Learning problems are based on three macro-areas: Data, Model, and Code which are deeply interconnected and provide continuous feedback to each other.

We’ve seen how to tackle these particular workflows and how MLOps can manage some unique aspects like complexities in managing model’s code in Jupyter Notebooks, exploring datasets efficiently with correct ETL jobs, and providing fast and flexible feedback loops based on production metrics.

Models are the second most crucial thing after data. We’ve learned some strategies to avoid concept drift and model aging in time, such as Continuous Training, which requires a proper monitoring solution to provide quality metrics over inferences and an adequate pipeline to invoke new model analysis.

AWS provides some managed services to help with model training and pipelines in general, like SageMaker AutoPilot and SageMaker Pipelines.

We have also seen that AWS allows for multiple ways of creating and deploying models for inference, such as using pre-constructed models or bringing your container with custom code and algorithms. All images are saved and retrieved from Elastic Container Registry.

We’ve talked about how collaboration is critical due to the experimental nature of Machine Learning problems and how AWS helps by providing an all-in-one managed IDE called SageMaker Studio.

We have features like SageMaker Experiments for managing multiple experiments, SageMaker Feature Store for efficiently collecting and transforming data labels, or SageMaker Model Monitoring and SageMaker Debugger for checking model correctness and find eventual bugs.

We’ve also discussed techniques to make our Machine Learning infrastructure solid, repeatable, and flexible, easy to scale on-demand based on requirements evolving in time. 

Such methods involve using AWS Cloudformation templates to take advantage of Infrastructure as Code for repeatability, AWS Step Functions for structuring a state-machine to manage all the macro-areas, and tools like AWS CodeBuild, CodeDeploy, and CodePipeline to design proper CI/CD flows. 

We hope you’ve enjoyed your time reading this article and hopefully learned a few tricks to manage your Machine Learning workflows better.

As said before, if Machine Learning is your thing, we encourage again having a look at our articles with use-cases and analysis on what AWS offers to tackle ML problems here on Proud2beCloud!
As always, feel free to comment in the section below, and reach us for any doubt, question or idea! See you on #Proud2beCloud in a couple of weeks for another exciting story!

#MachineLearning #DataScience #Python #AI #100DaysOfCode #DEVCommunity #IoT #flutter #javascript #Serverless #womenintech #cybersecurity #RStats #technology  #WomenWhoCode #DeepLearning #data #MLOps #DevOps







Rank

seo