Assignment 10 | I202 | Information Organization and Retrieval

Overview

I have chosen an enterprise data warehouse as the organizing system that I would like to focus on. A mid to large sized corporation has many types of immovable, movable infrastructure like office buildings, office furniture, transport buses and equipment. Out of these, the type of resource that is the most prized and dynamic is IT infrastructure. Servers, phone lines, network routers, laptops, PC’s, proxy networks are ubiquitous in an information-centric company. Managing these resources in the most efficient way possible is a primary goal of the management of a company. This is because the more efficiently these resources are handled, the less of a cost center IT equipment becomes. It can be seen as a profit center instead of a cost center. IT infrastructure is a very important resource that needs to be organized logically so that we can support interactions (like adding a resource, removing a resource etc.) efficiently without affects any dependent processes.

What resources are being used?

The resources that are being used in a company are generally the hardware and software components owned by the company. They can be a server, a network switch, a laptop or an application. These resources support the activities of a company for both its internal customers as well its external customers. But for our organizing system, which is a data warehouse, the resources are information components which have data related to these physical objects. This information can be about when a server was added to a network, when it was upgraded, when an issue was reported about the hardware and how much time it took to resolve it. The format of the source data fed into the data warehouse is an important distinction for the ‘Extract’ process of a data warehouse. The data can be received in the form of flat files (plain text), or from relational databases. Each format requires special technologies and handling procedures, like UNIX shell scripts for flat files and database procedures for relational data. Although all the information received are primary resources, they link to each other and hence form description resources for each other. An example would be, the ticket that was raised to replace a server was linked to a ticket that was raised to report a server crashing. The two information resources in themselves are primary resources, each containing the details when the ticket was raised, who it was assigned to and when it was resolved, but by linking them together they form metadata for each other. This is a kind of shared information component network that the data warehouse tries to catch. Depending on the focus of our resource, we can either have the same record as a primary or description resource. The resources primarily provide information about physical resources but they also capture information about the processes related to the these physical resources.

The resources are grouped together depending on the process or physical object they represent. Like information about a server or information about adding a new sever. Although they might be related to the same server, they clearly have different domains and different types of information. As a result, they will be treated as different resources.

The resources are names as per the domain they belong to. Like information about incidents raised for a equipment can be clubbed as an Incident entity. All change-related information can be clubbed as Change entity. The naming for these entities in the data warehouse conforms to a controlled vocabulary and a fixed syntax. Like a ‘server change-related information’ can be referred to a CHG. The corresponding table in the data warehouse would have a name with CHG as a suffix. Having a primary key for a server incident-related record as ‘12345-INC’ can ensure there can be no collisions with another record in a different domain. This qualified name also makes the identifier more informative.

Why are the resources organized?

The information resources are being organized in order to facilitate easy reporting. When every resource is intrinsically linked to other resources, getting the bigger picture can be a problem for mid-level and senior-level managers who are responsible for gauging the effectiveness of a new strategy and making required corrections. A person who manages a team responsible for resolving server incidents would like to know how many issues were resolved by his/her team, are they effective, should he/she get more people in her team, should he/she make changes to the way that people are assigned issues to work on. A single go-to point which can help answer all these questions is an operational necessity. The faster this organizing system can answer his/her questions, the more relevant the data warehouse becomes. That ultimately depends on the types of interactions the system supports, like having a canned report with trend graphs or exporting raw data so that it can be processed by another tool.

How much are the resources organized?

The user requirements decide how much the resources are organized. The effectivity of captured information is decided by the end users. Would they like to preserve data that was loaded to the data warehouse more than a year ago? Or are they only interested in last 2 months of data. If each user’s requirement is considered a set, then the union of all the sets of all possible users establishes the minimum amount of a data that needs to be maintained. Also, the level of granularity of data is again dependent on user requirements. If they are just interested in looking at a high-level summarized view of data, we need not preserve raw data pertaining to each and every ticket in the system. In this case the interaction is purely based on a collection level property. There could be a new entity that will be reported as a standalone process without any relationships with other processes. In that case, if there is no reporting requirement, it makes sense to just add that resource to the system without explicitly modelling the relationships that it actually has with other entities. If a star database schema is sufficient to satisfy user requirements, there would be no need to have a snow-flake data model which is inherently more complex and difficult to maintain. There could be cases where two separate domains always go together. Like ‘Incident’ information and “Rootcause” information. Every incident would have a root cause. But should we introduce these as separate entities or we can club them together into a single ‘Incident’ domain. This is a ‘one-one’ relationship which can easily be housed in a single table. I have noticed that this question is answered depending on the performance implications of having too many resources in the system as joining them and rendering them in a report would be time consuming. So wherever possible, collocation (denormalization) would be make better sense than unnecessarily creating too many resource categories.

When are the resources organized?

The resources are organized depending on the level of time granularity the user wants. In this specific case, the manager might want to look at the number of incidents resolved on a day-day basis. If the application or group they are supporting is critical, day to day monitoring makes sense. In this case any dip in the productivity of the team can be quickly caught and remedial action can be taken. The frequency of the data warehouse loads reflects this priority.

Who does the organizing?

Automated processes designed to Extract, Transform and Load data do the organizing on a daily basis. The designer of these processes, the data architect, creates the organizing system principles (data model) after interacting with the end users and inquiring about their requirements and specific needs. These requirements are almost always related to what kind of data the users would like to see together, in a single pie chart or in a single table. Understanding the relationships between the resources and maintain the ability and efficiency of the system are the key concerns that the organizer has.

Other considerations

Readability of the code and documentation matters a lot while creating a data warehouse. In this case, the person or team creating the organizing system moves on once the system has been created and a support team would maintain the system. If the design is too complex, or if the various processes were not documented properly, it would lead to a lot of issues in supporting the system and making modifications to existing functionality. The organizing principles need to be clearly enunciated so that new interactions can be added without affecting any existing functionality.