Table of Contents |
---|
Project Overview
- Project Name: Argus / Dashboard
- Purpose: To assess whether the Argus tool could be ‘fit for purpose’ as a Dashboard to our network issues and alert system.
Current State Assessment
System A (As-Is)
- Overview: Existing Internal Dashboard (Java Application)
- Strengths:
- Aggregates alarms from Infinera, Juniper, IMS etc
- Familiar to the consumers of the service
- Tried and tested
- Been through years of development and features requests / refinements
- complex filters
- blacklisting
- Different viewing modes (Large screen mode, pc mode)
- Alerts can have different states
- Has an API
- Retains history of alarms for reporting purposes
- We own and drive the source code
- Weaknesses:
- Hard to update due to being a legacy Java code base
- Potential security issues due to being a legacy codebase
- Doesn’t integrate well with modern tooling that we have developed over the years
- Isn’t as pretty as modern GUI since the advent of HTML5
- Complex build pipeline
System B (As-Is)
- Overview: Argus
- Strengths:
- Looks great. Modern UI (HTML Frontend easy to develop and extend. Google-able)
- Built on a solid well known technology stack. (Django backend)
- Potential to use partially managed development freeing up internal resources
- Developed within the family of networks
- Source code is available
- Potential for different screens based on user permissions
- Easier to extend functionality (i.e. short-lived alarms could be integrated?)
- Plug-in Architecture
- Fun to work with
- Weaknesses:
- ontological alignment. Our “Alarms” and Argus “Incidents” are slightly different, so we need to explore the consequences of this (e.g. we want “Alarms” to be displayed that don’t have incident tickets in the ticketing system)
- Django requires internal knowledge or even ‘Django devs’
- No Alarm states (our “phases”)
- No history
- Only one line of acknowledgment (We have first and second line requirements)
- No ability to drill down
- No blacklists
- Filtering not as comprehensive
- Won’t naturally coalesce or correlate (integrations required)
- Flapping not handled (for future)
- Prioritisation not handled
Desired Future State
- Overview: We realise a shared tool could be used between NRENs for Alert aggregation that follows ITIL practices and standards.
- Argus has put itself forward as a viable candidate for an Alert aggregation tool by making their code open source and sharing its usage and availability at networking conferences.
- We don’t want a fork of Argus, but strongly prefer a unified system that can accomodate an extended use case. Our rough impression is that the UI “skin” would be a relatively straightforward part to develop, but the fundamental Argus backend use case differs – the main discussion points will be to decide if it’s feasible to have a common backend and/or pluggable architecture that can accommodate both applications.
It meets some of our requirements. However misses others. To be a complete replacement for existing tools it must also achieve at least the following :
- Simple to develop new features (on request of users. Co-Invention with the NOC, SOC teams)
- Deep diving. Tools for drilling down into Alerts. Coalescing and correlating issues for remediation
- Simple integration to our existing workflow. API’s etc
- History is vital for reporting on availability and utilisation of services and a major requirement of our stakeholders
Gap Analysis
Functional Gaps
Feature/Functionality: Alarm States
- Current State (System A): Alarm states are complex and can also be coloured or flashing. The first and second line support teams are trained to recognise these at a glance.
- Current State (System B): Alarms only appear to have one fixed state
- Desired Future State: That alarms can have states. For example flashing if new. Or different colours to say whether they are pending or urgent.
- Gap: The current inability of Argus to support multiple alarm states (e.g. flashing for new, different colours for pending/urgent) poses a functional gap when compared to the desired future state. The absence of such features hinders the support teams ability to react quickly to distinguish or respond to the different alarm conditions, leading to a potential delay in addressing critical issues. This could have an impact on operations and UX.
Feature/Functionality: Multiple stages of Acknowledgment
- Current State (System A): Dashboard currently demands that both 1st and 2nd line support teams to acknowledge that Alerts have been recognised
- Current State (System B): Alarms only have 1 level of acknowledgement. ( ‘Acked’ is a tag or status )
- Desired Future State: That Alarms have 2 levels of acknowledgement
- Gap: The current inability of Argus to support multiple teams poses a problem and is a functional gap. 2nd line support don’t need to be wasting their time on issues that have already been investigated by first line of support. Or similarly do need to know whether 1st line has addressed a given issues with in the time set by the SLA.
Feature/Functionality: Correlation and Coalescing
- Current State (System A): Supports coalescing and correlating issues for remediation.
- Current State (System B): Lacks the ability to coalesce and correlate alerts effectively.
- Desired Future State: A system capable of intelligently correlating and coalescing related alerts to provide a more consolidated and actionable view.
- Gap: The absence of advanced correlation and coalescing capabilities in Argus may lead to an increased workload for support teams and hinder the efficiency of issue resolution.
Feature/Functionality: Status of live systems
⁃Current State (System A): Shows a status of services (traffic lights)
⁃Current State (System B): Lacks the ability to show a status of live services
- Desired Future State: The system should show the status of collector, classifier, correlation, inventory provider
- Gap: The absence of component for showing our live services means the teams cannot be confident that the Alarms are currently up to date or not missing information.
Feature/Functionality: Priority
⁃Current State (System A): Alerts can be prioritised by a number
⁃Current State (System B): Lacks the ability to prioritise. Only has severity which is different
- Desired Future State: The system should show the support team to set a priority order on the tickets
- Gap: The absence of prioritisation make it difficult for the NOC and different lines of support to know which Alert should take precedence
Technical Gaps
Integration Points: Modern Technology Stack
- Current State (System A): Uses a legacy Java code base
- Current State (System B): Built on a modern web stack
- Desired Future State: A technology stack that aligns with modern development practices and tooling
- Gap: The current dashboards reliance on a legacy Java code base presents a technical gap in comparison to the desired future state. This may result in challenges related to updates, security and integration with modern tooling
Integration Points: API Flexibility
- Current State (System A): Provides a well-established API for integration with other tools.
- Current State (System B): Has limited API flexibility.
- Desired Future State: A flexible API architecture that allows seamless integration with various tools and workflows.
- Gap: The current limitations in Argus' API flexibility might pose challenges in integrating it with existing and future tools within the network infrastructure.
Data Gaps
Data Flow: History of Alarms
- Current State (System A): Data retention
- Current State (System B): No data retention
- Desired Future State: We would like that Argus had an additional table created that could store the history of Alarms. This is useful for reporting.
- Gap: The current system does not keep alarms in history. We would need all arms to be stored infinitely. This is not just useful but a requirement from reporting.
Data Flow: Real-Time Monitoring
- Current State (System A): Supports real-time monitoring and updates.
- Current State (System B): Lacks real-time monitoring capabilities.
- Desired Future State: Real-time monitoring features to ensure timely detection and response to critical network events.
- Gap: The absence of real-time monitoring in Argus may result in delays in identifying and addressing urgent issues, impacting overall network responsiveness.
Recommendations
- To bridge the gaps we would need to either fork the Argus project or engage with their developers about the possibility of ticketed development work. As stated above, a fork of the Argus project would not be the desired outcome.
- Validate this against our own schedule.
- Break down all the tasks into granular measured pieces of work and deliverables that can be managed and developed in an Agile way.
- Decide if ITIL naming conventions and standards are compatible with the existing OC workflow
- Respect that the Argus team would be building the tool to be multi purpose and identify where we might need to compromise or create internal requirements using plugins to fill any gaps they can’t fill
- Be involved in plugging the functional and technical gaps with some internal work for parts of the system that require integration. I.e. by building plug-ins for rendering the status of services (traffic lights)
- Ensure close collaboration with the Argus development team and clear communication of expectations and requirements.
Action Plan
- Meet with the Argus development team by [DATE] to try and define a ways of working to make sure what can / can’t be done
⁃create a list of questions
- Assess whether some development can be done in-house
- prototypes with Plugin Architecture
- Assess the upgrade plan for future releases of the tool
- Do they envisage a forked version? How will we upgrade the ‘core’?
- Consult with all internal engineers on the viability of outsourcing some of the development
- Create a Takeaway Document for the Argus team to respond to
⁃A list of bullets that we need definite answers on. Maybe even inclusive of timings?
Risks and Mitigations
- The biggest risk is not meeting deadlines expected of us by consumers and users of the service
- understanding what’s viable for delivery.
- create a statement of work that underlines these deliverables
- Underestimating the task if we don’t utilise Argus and take all the development in-house.
- We don’t have to retire the previous service until it has passed A/B testing
- If the 3rd party development team later decides they don’t have capacity to work with us for bugfixes, feature requests, enhancements, etc
- We would have to fork off the development. Do we have the Django/React skills to take over the development internally?
- If the solution is to fork the current state of the Argus project, what are the expectations (from all sides) regarding alignment on future improvements.
Conclusion
A new and modern alert aggregation platform is required by the Geant NOC/SOC first and second line support teams. One that can satisfy the needs of the consumers of the service but can be maintained by the development team.
There is currently a backlog of feature requests coming from the NOC. We must recognise the importance of balancing user needs with development team maintainability.
Any conclusions must be reached by a consensus within the internal development team, It should be agreed that cooperation and sharing is a proactive and viable course of action. One that allows us focus on other development requirements like testing, automating deployments, architecture and integration with other services.