BUSINESS PURPOSE
The "PRIDE"-Data Base Engineering Methodology (DBEM) is an
important part of an overall philosophy of Information Resource
Management (IRM) as defined by "PRIDE". This involves the
development and control over all of the resources required to
produce information. Whereas the "PRIDE"-Enterprise Engineering
Methodology (EEM) is principally concerned with developing an
Enterprise Information Strategy, methodologies such as DBEM and
the "PRIDE"-Information Systems Engineering Methodology (ISEM)
are concerned with actually creating the data and system
resources needed to produce information.
The intent of any of the "PRIDE" methodologies, including
DBEM, is to define the business environment as to "Who" is to
perform "What," "When," "Where" and "Why" (the "5-W's"). As a
result, it is used to convert a heterogeneous operating
environment into a homogeneous environment. This improves
communications and promotes cooperation and teamwork throughout
an enterprise. Better organization and discipline also enhances
the ability to build quality products and make effective use of
resources. In addition, to the 5-W's, the methodology provides
"How" to perform the work by providing a variety of techniques
and tools deployed throughout the methodology. A methodology,
therefore, resembles an assembly line where work is performed in
manageable stages.
DBEM is a generic and universally applicable approach for
building any type of data base, regardless of industry, type of
application, software language/technique, or data base tool. A
Data Base Management Systems (DBMS) is not a prerequisite for
DBEM. The methodology is based on tried and proven approaches
that are so fundamental to sound data base design that tailoring
to individual development requirements is not only unnecessary
but highly undesirable.
CONCEPTS & PHILOSOPHIES
INTRODUCTION
Data is one of the resources needed to produce
information. This implies that Data exhibits distinctively
different characteristics than Information. Information
represents the intelligence or insight needed to support
business actions and/or decisions. A data element by itself
is meaningless. It is used to define the facts and events
about a business. Data identifies and describes the objects
of importance to the enterprise, such as products, orders,
customers, vendors, parts, billings, payments, shipments, etc.
It is also used for quantitative purposes in measurements and
calculations. A single information requirement, thereby,
represents an assemblage of these business facts presented
in a specific context and time frame. In this respect, data
represents the raw material needed to produce information.
Obviously, one data element can support many information
requirements. Because of this, it is necessary to manage
data like any other resource.
Data is the binding force behind information systems.
The only way systems communicate, either internally or
externally to other systems, is through data. For this reason
alone, data must be controlled to promote sharing between
information systems.
The management of data as a resource begins with a
corporate attitude and disposition, not with an elaborate set of
tools. Data Resource Management requires the same perspective
as managing the parts department of a manufacturing company.
The objective is twofold: to classify and standardize resources
so they may be shared by multiple applications, and; to control
the collection, storage and distribution of resources to
minimize overhead. Both are concerned with the efficient and
cost effective use of resources.
Under DBEM, a data base is defined as all the data required
to produce information, regardless of where the data is used or
how it is stored. From this perspective, all companies have
a data base. In fact, the day a company begins to conduct
business is the day its systems and data base are born. Both
evolve with the business over time as the company's information
needs change.
Data Resource Management is another area where corporate
management has abdicated its responsibilities to technicians
who have turned a simple concept into an esoteric technical
practice. As computer technology evolved, several physical file
management techniques were introduced to manage data on the
computer, most notably the Data Base Management System (DBMS).
The intent of the DBMS is to physically store and retrieve data
for use in computer programs. In fact the term "DBMS" is a
misnomer since it only deals with data on a direct access device
(disk). It does not deal with data on other devices, such as
tape files, card files, manual files, etc. Therefore, it does
not manage the entire data base, only a portion of it. Nor does
it do any logical file management which is more important than
the physical file management.
Although the DBMS was originally designed to permit the
sharing of data among applications, this has seldom been
implemented. Due to a lack of management discipline, the DBMS
is one of the most abused and misapplied products in the
industry. It is typically used as nothing more than an elegant
file access method, not as a tool for integrating systems.
What this points out is that companies have been taking a
tool oriented or physical approach to managing data. Despite
the considerable investment in DBMS technology over the last 25
years, very few companies have realized a managed data base
environment. Why? Primarily due to management's failure to
recognize and treat data as a reusable resource. Data Resource
Management is a "materials management" issue, nothing more,
nothing less.
Imagine a manufacturing company without a materials
management function. Under this scenario, engineers would
design products without consideration for the other products
marketed by the company. Each product would be designed with a
unique set of parts. Inevitably, many duplicate parts would be
designed. Without some form of coordination, there would be
significant overhead and waste from collecting and storing
redundant parts. Also, because no formal control mechanism
existed to track parts, implementing changes to parts
consistently throughout a product line would be a haphazard
endeavor.
This is exactly the situation that occurs during
traditional systems development. Each analyst and programmer is
permitted to design data bases unique to their application. The
result: rampant data redundancy throughout the organization.
There is a natural tendency for an analyst to do only what is
best for their individual assignment, not necessarily what is
best for all corporate applications. This is why a neutral
third party is required, to coordinate and standardize data
resources on a corporate basis. This is the mission of the
Data Resource Management function.
DATA BASE CONSTRUCTS
Data resources can be organized into a generic and
universal structure. The basic building block is the data
element itself, the representation of an individual fact or an
event. A collection of one or more data elements is a record
and one or more records make up a file. As previously
mentioned, a data base represents all of the data used to
produce information, regardless of where used or how stored.
BASIC CONSTRUCTS
All data resources are structured in this generic manner.
Terms such as "schema," "sub-schema," "segments," "tracks,"
"cylinders," "sectors," "tables," "arrays," "tuples," "data
stores," etc., (all of which deal with particular computer
techniques and tools), can all be translated into the basic
constructs mentioned above.
The organization of data serves two purposes: one is to
logically describe the "objects" used to manage and operate the
business, and; to express how data will be physically stored.
The differences between logical and physical are substantial;
there will not necessarily be a direct relationship between
the two.
LOGICAL/PHYSICAL CHARACTERISTICS
There is not necessarily a one-to-one relationship
between logical and physical
Physical files may differ considerably from logical files.
Here, the file represents a particular way of physically storing
data. Data may be physically stored in a variety of files, such
as an indexed file, a "flat" file, a DBMS file, etc. Even
manual files follow this model with the exception they also
store inputs and/or outputs (both of which consist of records
and data elements).
Unlike the logical file that is organized according to a
unique data identifier ("primary basic grouping"), the physical
file does not require any specific organization and can use any
sort/access key desired. Ultimately, it depends on the file
management technique or tool being used.
The logical view of data is the basis for all physical
data base design, regardless of the file management technique or
tool selected. The physical files must ultimately carry out the
intentions of the logical files in terms of what data must be
stored, the dependencies between data, and volume. As a matter
of fact, all DBMS packages can implement these logical views,
regardless of whether they have a hierarchical, network,
relational, or object-oriented structure.
LOGICAL TO PHYSICAL RELATIONSHIPS
Again, there are substantial differences between logical
and physical files. Perhaps the most noticeable difference is
the logical file will remain relatively static while the
physical file will change dynamically, based on advances in
technology. One of the most important reasons for defining data
resources logically is to seek data independence from the
physical environment, thus allowing any physical implementation
without disrupting systems.
OBJECT CONCEPT
In its simplest terms "objects" represent "things" used
in the operation of an enterprise, such as Products, Parts,
Customers, Employees, Vendors, Orders, Shipments, Billings,
Payments, etc. Objects are initially identified when the
business is defined under the Enterprise Engineering Methodology
(EEM) and passed to DBEM for further definition.
Objects are uniquely identified by a single data element
which is referred to as the "primary basic grouping." For
example, "Customer Number" is used to uniquely identify a
"Customer" object; "Shipment Number" identifies a "Shipment";
"Order Number" identifies an "Order"; etc. This means that
AN OBJECT IS AN OBJECT WHEN A UNIQUE DATA ELEMENT HAS BEEN
ESTABLISHED TO IDENTIFY IT. Further, the "primary basic
grouping" becomes the common bond that relates data to one
another. For example, all customer related data will be
"grouped" around the "Customer Number," shipping related data
around "Shipment Number," etc.
Not all numbers and codes will necessarily represent
objects. For example, to a post office or shipping company, a
Postal Territory is extremely important and requires management.
As such, a "Zip/Postal Code" data element is created to
uniquely identify each territory. However, to other companies,
a Postal Territory is of little concern or importance; it is
not an object they must deal with in the operation of their
organization. In this situation, "Zip/Postal Code" is used for
nothing more than descriptive purposes about "Customers,"
"Vendors," "Employees," etc. This means AN OBJECT MUST BE
RELATED TO AT LEAST ONE BUSINESS FUNCTION THAT IS RESPONSIBLE
FOR ITS CONTROL. IF SUCH A BUSINESS FUNCTION DOES NOT EXIST,
IT IS NOT AN OBJECT.
As another test of the validity of an "object," consider
how the data element is assigned (its "source"). If the data
element is assigned from an external source, then in all
likelihood the object is not valid. To illustrate, "Zip/Postal
Code" is assigned by the Post Office (not by the average
company). "Postal Territory," thereby, is a pertinent object
for the Post Office, but not for the average company.
TYPES OF OBJECTS: FACTS AND EVENTS
There are basically two types of "objects": facts and
events. Factual objects represent tangible things such as
Products, Parts, Employees, and Vendors. In contrast, event
related objects are intangible things that are date/time
related, for example: Order, Shipment, Billing, Purchase, etc.
An event represents some form of interaction between two or more
factual objects. To illustrate, a "Customer" (fact), places an
"Order" (event), for a "Product" (fact).
One nuance that distinguishes facts from events is the type
of data associated with each. "Names" and "Locations" can be
associated with facts; "Dates" and "Times" with events. For
example, "Product Name" makes sense, "Product Date" does not.
"Order Date" makes sense, "Order Name" does not.
Since logical files are used to model "objects," they
contain only primary data elements to identify, describe and
quantify the object. Generated data is not included in logical
files since it can be generated based on the primary values.
(This is another example of the differences between logical and
physical files; in some situations, it is practical to store
generated data elements).
UNDERSTANDING "VIEWS"
An "object" is divided into "views" to represent different
perspectives about an object. Data elements could conceivably
be grouped into one large record, but this would not provide any
significant insight into objects and data. Grouping data into
distinctly separate views, simplifies our perspectives on data.
The objective is to uniquely identify an occurrence of data so
that it is not confused with another (for example, "Quantity
Shipped" versus "Quantity Ordered.") There are three types of
"views" (logical records) associated with "objects":
1. IDENTIFICATION VIEW - normally consists of a control number
or code to identify the object followed by a name or date; for example:
EMPLOYEE OBJECT - Employee Number & Name
PURCHASE OBJECT - Purchase Order Number & Date
There may be additional data elements used in this view.
However, the point is, all objects have one identification
view, regardless if they are facts or events. This implies
that if the Identification View is deleted, then the other
views in the objects are also deleted, as well as pertinent
relationship views in other objects.
2. CHARACTERISTICS VIEW - consists of data that is not used to
identify a separate object, nor create a relationship to
other objects. Instead, it is used to describe the internal
characteristics of an object; for example, a "Customer"
object may have two types of characteristic views:
ADDRESS - Including "Customer Number," "Type Address"
(a code to identify an address type for
Shipping/Billing/Mailing purposes), "Address,"
"City," "State," "Zip/Postal Code," and "Country."
CONTACT - Including "Customer Number," "Contact Number"
(to identify each customer contact), "Name,"
"Telephone Extension," "Internal Mail Code/Drop."
This implies that an address and a contact are not strong
enough to represent separate objects by themselves. Rather,
they are describing various aspects of the customer object.
An Object can have one or more of these views (or none).
Factual objects will typically have Characteristics Views;
Event related objects normally do not.
3. RELATIONSHIP VIEW - consists of data that is used to
establish relationships between objects. For example,
in a Shipping Record, there may be data elements to relate
"Quantity Shipped" to a Shipment ("Shipment Number"), to a
Product ("Product Number"), and to a Customer ("Customer
Number"). This implies relationships between..
CUSTOMER ------------- SHIPMENT ------------- PRODUCT
Event related objects will have at least one Relationship
View; Factual objects normally have none. There cannot be
more than two facts associated with a single event/view.
The relationship view connects with identification views (not characteristic
views) through record-to-record relationships. To illustrate:
CUSTOMER SHIPMENT PRODUCT
IDENTIFICATION<------->>RELATIONSHIP<<------>>IDENTIFICATION
RECORD RECORD RECORD
(1:M) (M:M)
These relationships require clarification to describe the
nature of each relationship, as denoted by the arrow
notation. In the example above, a Customer may have many
Shipments, but a Shipment pertains to a specific Customer
(this is a one-to-many relationship (1:M)). Also, a
Shipment may consist of many Products, and a Product can
be used in many Shipments (many-to-many relationship (M:M)).
These relationships and the basic grouping of the view
establish the constraints of the data base and are essential
in physical data base design.
BASIC GROUPING CONCEPT
What distinguishes the different views of an object is
the "basic grouping" of the logical record. The term "basic
grouping" refers to the indicative data elements used to
uniquely identify a view. It also represents a dependency
between data elements in a particular context (it is how they
are "grouped" into separate views). For example, in a Customer
Object, it is used to segregate address data, from credit data,
from customer contact data, etc. Perhaps it is easier to think
of the "basic grouping" as the key to a logical record (not a
physical record). Because the intent is to uniquely identify
data, the "basic grouping" consists only of "Indicative" data
(not "Descriptive" or "Quantitative").
There may be up to two parts in a single basic grouping:
PRIMARY BASIC GROUPING - Since views are used to describe
objects, they must all be defined with the one data element
used to uniquely identify the overall object; this will be
a primary/indicative/object-oriented data element. The
Primary Basic Grouping will be the basis to sort views into
objects (e.g., all Product related Records will be put in
the Product File, all Customer related Records in the Customer File, etc.).
SECONDARY KEYS are used to either distinguish Characteristic
Views or Relationship Views.
For Relationship Views where it is necessary to relate one
object to another (such as an Order to a Customer and to a
Product) additional primary/indicative/object-oriented data
elements may be added to the basic grouping. These data
elements are sometimes referred to as the "Foreign Keys" to
another object.
The concept of "Referential Integrity" (as commonly referred
to in the industry) is concerned with the logical consistency
between views or records. It requires that every occurrence
of a foreign key has a corresponding occurrence as a primary
basic grouping in an identification view. As in the
"Customer/Shipment/Product" example previously mentioned,
"Customer Number" and "Product Number" can only be used as a
Secondary Key as long as they are also used as a Primary
Basic Grouping in other identification views.
As an aside, since a "Relationship View" typically applies to
event related objects, the secondary key should normally
consist of object-oriented identifiers related to factual
objects. This will be used to bridge factual objects through
the event object (see "Other Object Considerations").
Characteristic Views also require Secondary Keys. However,
the intent here is not to establish a relationship to another
object, but rather to establish a separate view within an
object (an internal relationship). Under this approach,
"view identifiers" are used to segregate data into separate
records. For example, "Type Address" is used to distinguish
address related data from other data.
The sequencing of the basic grouping is extremely important
for three reasons:
- It is the principal criteria for combining logical records
into logical files (based on the primary basic grouping data element).
- It is the principal criteria for establishing relationships between
logical records (based on secondary keys).
- As we will see when we discuss "Descriptive" and "Quantitative" data,
the "basic grouping" gives meaning to the data.
Because of its importance, the "basic grouping" must be
established in a prescribed format. Unlike a key in a physical
record, which can be set to whatever is convenient, the basic
grouping must be assigned as:
- The Primary Basic Grouping first.
- Secondary Keys last.
OTHER OBJECT CONSIDERATIONS
Factual Objects will typically relate to other Factual
Objects through Events. This is quite common and natural. As
facts or events are being defined, the analyst should challenge
what peripheral facts and events are involved.
FACTS & EVENTS
The "PRIDE" concept of "Objects" is an advanced refinement
over the "Entity/Relationship" model as commonly referred to in
the industry. The differences are subtle but significant:
- "Entities" typically represent only "facts" and have trouble depicting
"event" relationships. Under DBEM, an "event" is just another "object," thus
simplifying the establishment of relationships.
- "Entities" are typically scattered with no cohesive bond to unite common
entities. For example, there is no one point where a global view of a customer
can be found. In contrast, the "Object" concept uses the primary basic grouping
of the various views to group compatible logical records into a single logical
file, thus providing a total picture of the object.
Coincidentally, the "PRIDE" concept of "Objects" is
compatible with what the industry refers to as "object-oriented"
data bases and programming. These are techniques that were
developed independent of DBEM, yet are complementary.
APPLICATION VERSUS ENTERPRISE
It would be easy to say there are just two types of
data base models, logical and physical. However, there is
another perspective that adds a different dimension to this,
and that is how data is viewed from an "enterprise" versus an
"application" perspective.
An "application" view refers to the data used in a
specific system. In terms of the logical model, it represents
the "local" data used to describe objects for a particular
Information System. It represents only those data elements
required to satisfy the information needs for a particular
application. Obviously, this will not necessarily be the
"global" view of the object, which is the intent of the
"enterprise" view. In other words, the "application" view will
usually be a subset of the "enterprise" view of data.
"APPLICATION VERSUS ENTERPRISE"
| APPLICATION VIEW OF A CUSTOMER
| ENTERPRISE VIEW OF A CUSTOMER
|
Customer Number
Name
Credit Rating
|
Customer Number
Name
Credit Rating
Contact Number
Title
Telephone
Address Code
Address
City
State/Province
Zip/Postal Code
|
The "enterprise" view represents a complete picture of the
object, with all of the data required to satisfy all
applications, not just one. Under this arrangement, there may
be multiple "application" views of objects, but only one
"enterprise" view of an object. In fact, it is quite common to
have many different "application" views of an object. One
system may require certain data elements about a customer object
while another requires a totally different set of data elements
to describe a customer. These legitimately separate views of
the customer, as defined by Systems Engineering during design,
are coordinated through the enterprise view of the customer as
controlled by the Data Engineering function.
When a system is designed into sub-systems with logical
files, the "enterprise" data base is adjusted to accommodate the
"application" data base. If the objects encountered in the
system are new to the enterprise, then new enterprise views must
be defined. Initially, the application and enterprise views
of an object are identical. As new applications are introduced
with different views of the same object, then the enterprise
view is modified accordingly by Data Engineering.
This application/enterprise relationship highlights the
fact that data base design is an evolutionary process. Other
data base design techniques typically take a "revolutionary"
approach by trying to identify all of the data requirements
for the entire company at one time. Obviously, the problem
with this approach is that it becomes an enormous and
unmanageable data base design project with questionable results.
Whereas the evolutionary approach naturally synchronizes the
data base with all of the applications, the revolutionary
approach develops a data base that will not necessarily match
the applications.
Under the evolutionary approach, the corporate data base
will expand and contract naturally as the business and
applications change. Consequently, excessive or unnecessary
data definitions will be avoided.
THE FOUR DATA BASE MODELS
The variables of logical versus physical and application
versus enterprise results in four data base models:
- The Application Logical Data Base Model (ALDBM) represents
all of the primary data elements needed to satisfy the information requirements
of a single application. In other words, all of the data needed to describe the
objects pertinent to a given information system. The ALDBM defines the logical
files used in a single system. It also represents a subset of the Enterprise
Logical Data Base Model.
- The Enterprise Logical Data Base Model (ELDBM) represents
the primary data elements used to describe all objects in an enterprise, not
just that data used in a single system. It represents all logical files in
the corporate data base.
- The Enterprise Physical Data Base Model (EPDBM) represents
how the data in the ELDBM is physically stored in files. The corporate data
base can be either centralized or distributed. A variety of file management
techniques can be used to store the data, e.g., computer files, manual files, etc.
The EPDBM, therefore, defines all of the physical files in the corporate
data base.
- The Application Physical Data Base Model (APDBM) represents
subsets of the EPDBM used to fulfill a specific application. It satisfies the
data requirements of the ALDBM and denotes the physical files used in the system.
As can be readily seen, there are corresponding relationships between the four data
base models; the enterprise view must satisfy the application view and the physical must
implement the logical. This type of data resource definition provides for the integrity
of the data base by assuring that only those resources required to serve changing business
information needs are maintained. In summary, it assures the corporate data base will be
correctly synchronized with all of the various applications.
The relationships between the four data base models can be
rather extensive. Diagrams using boxes and arrows are fine for
expressing simple relationships, but this is seldom the case.
Instead, a set of matrices with horizontal rows and vertical
columns is a much more convenient and simpler approach for
expressing these relationships. These data base relationship
matrices are similar in intent and format to those mentioned in
Enterprise Engineering. Data base matrices are used to express
relationships between:
- Application Logical Records
- Enterprise Logical Records
- Enterprise Physical Records
- Application Physical Records
- Application Logical Records to Enterprise Logical Records
- Enterprise Logical Records to Enterprise Physical Records
- Enterprise Physical Records to Application Physical Records
- Application Logical Records to Application Physical Records
One by-product of data base modeling is that it provides for a complete
description of how data is used throughout a company, regardless of where used
or how stored. It tracks where data is collected, stored and retrieved. This
type of inventory control implements one of the basic missions of resource
management.
DATA AS A RESOURCE
Like a component in a product, data is a reusable resource
that can be shared between applications. To avoid redundancy
and to verify the integrity of the component, data must be
defined with a high degree of precision. Otherwise, it will be
virtually impossible to check for duplication. In order to
maintain its "cleanliness," data must be specified and
classified in the same manner as any other part in a product.
How data is defined will dictate how it is used. If
management cannot see what constitutes an element of data or how
it is derived, its validity, currency and accuracy will be
highly suspect. Ultimately, users will not be able to trust it.
They will not be able to truly tell if they can base decisions
and actions on information created by using data they do not
understand or know its source. This type of specification
traditionally is defaulted to obscure and inflexible program
source code. If it is not visible, it is not reusable. This is
why data elements like "Net Pay" are difficult to maintain,
because their logical calculations reside within source code and
are not maintained separately. The definition of data should be
used as the specification for programming. A program,
therefore, is nothing more than a mechanism to carry out
intentions of how the data should be manipulated and processed.
LOGICAL VERSUS PHYSICAL DATA
There is more to defining a data element than providing a
cryptic program label, yet this is all that is commonly
considered by the average programmer. This is far too vague and
inconsistent to assure any precision in data definition. There
are actually two aspects to be considered when defining data:
its logical meaning and its physical implementation.
A data element can have only one logical definition but can
have one or more physical implementations. If a data element is
an expression of a single fact or an event, it is important that
it be explicitly defined so it will not be confused with
another. If there is a genuine difference in interpretation of
the meaning of data between users, then more than one data
element is involved.
Although standardization of data's physical characteristics
is an objective, there can be multiple physical representations
of data. For example, there can be several legitimate ways to
represent "Calculated Delivery Date":
December 11, 2004
11 DEC, 2004
DEC-11-04
In this example, the data element has a singular logical
definition, "The calculated date when a delivery is due." All
that differs is how the data element is physically represented.
What this points out is the physical characteristics of data
may vary from one application to another.
Obviously there are significant differences between how a
data element is logically and physically defined. Systems
Engineers and Data Engineers primarily deal with the logical.
Software Engineers and Data Base Administrators work with the
physical definition. Data Resource Management must govern both.
THE ANATOMY OF A DATA ELEMENT
A well defined data description contains vital
intelligence for establishing relationships between data
elements, and for constructing logical records and files.
Superficial or inaccurate data definitions will produce
erroneous results. For example, logical data base design is
totally dependent on the precise and accurate definition of
data. The objective when defining data, therefore, is to
prove that the data element is unique and non-redundant,
thus promoting sharing and re-using resources. This, in turn,
leads to system integration.
As mentioned, there are two aspects to data definition:
Logical and Physical. This narrative will describe both with
emphasis on Logical Characteristics.
LOGICAL CHARACTERISTICS
A. NAME AND DEFINITION
Each data element has a proper name that it is commonly
referred to in business (not necessarily how it is used
in programming), such as "Employee Number," "Address,"
or "Net Pay." To eliminate confusion, the name should
not be redundant with another item.
The data element also has a textual description expressed
in a Webster or Oxford style dictionary format. The
description provides the meaning of the data element and
should be expressed in the terminology of the business.
B. TYPES OF DATA
There are three types of data elements: Indicative, Descriptive, and
Quantitative.
INDICATIVE data is used to uniquely identify an object
in part or in full. "Uniqueness" is an inherent property of indicative data
so that it can be used to clearly differentiate occurrences of an object. This is why
control numbers and codes are typically used as indicative data, as opposed to names.
Names can be too vague. For example, there may be more than one employee
named "John Smith." Without some form of qualifier, it is virtually impossible to
distinguish one "John Smith" from another. Consequently, an "Employee Number" is
assigned to uniquely identify each employee.
In most major corporations, "names" are treated as
descriptive data due to the volume of occurrences
(many employees, many products, many customers, many
parts, etc.). However, in smaller enterprises, where
there is not a high volume of occurrences, "names" are
a much more effective means for controlling occurrences.
For example, at a small produce market, the "Produce"
object is uniquely identified by name, not by number.
The point is, numbers and codes offer better control in
a major business, yet in a smaller organization, names
may be more practical and easier to use.
Indicative data is used to identify either a whole
Object, or a View within an Object. The difference
between the two is significant. Data elements such
as "Part Number," "Product Number," "Shipment Number,"
and "Billing Number" are strong enough to identify a
whole object by themselves (facts or events). Other
data elements may not be strong enough to represent a
separate object and are subordinate to the object-oriented
identifiers. For example, "Contact Number"
may be used to represent an individual person within a
"Customer." In this situation, a "Contact" object is
not strong enough to be independent from a "Customer"
(the company does not manage "Contacts," it manages
"Customer Contacts" instead). Under this scenario,
"Contact Number" is subordinate to "Customer Number" and,
as such, is used to represent a view of a customer.
"View Identifiers" are typically encountered on
"Characteristics Views" and are used in situations
where there are multiple occurrences (repeating groups)
of the same data element. As in the "Customer Contact"
example, a customer may have many people within it to
contact. To distinguish each person, a "Contact Number"
is devised to uniquely identify each person. This is
similar to a body of text where a "Line Number" is used
to differentiate each line of text.
A "View Identifier" is not used on an "Identification
View" of an object since the view implies a single occurrence of a data
element, not multiple. For example, an Employee will only have one name,
an Order will only have one date, etc.
A "View Identifier" is typically not used on a "Relationship View" since
object-oriented identifiers should be strong enough to uniquely identify
each occurrence of a data element.
DESCRIPTIVE data consists of alphanumeric characters that
are not strong enough to identify an object, but convey important business
facts about an object, such as names, addresses, text, codes, etc.
QUANTITATIVE data deals with numeric values that are either
calculated or are calculable. Measurements and computations are typical
examples: "Net-Pay," "Quantity Ordered," "Elapsed Time," "Percent
of Gross," etc.
It is sometimes difficult to differentiate Quantitative data from Indicative
data. Indicative data will often use numeric values for identification purposes,
such as "Invoice Number," "Purchase Order Number," "Customer Number," etc. However,
it would be a mistake to use these numbers for quantitative purposes (aside from
counting the number of occurrences).
Descriptive data should be expressed in the simplest of
terms, allowing their dependency to Indicative data (the
basic grouping) to express their meaning. For example:
INDICATIVE DATA + DESCRIPTIVE DATA = MEANING
Customer Number Name Customer Name
Product Number Name Product Name
Vendor Number Address Vendor Address
Employee Number Address Employee Address
This approach results in a minimal number of primary data
definitions. The alternative would be to define many names,
many addresses, and many dates. Data elements such as
"Employee Name" should be defined only in the absence of a
control number/code (such as "Employee Number"); in this
situation, "Employee Name" is indicative, not descriptive.
Names should automatically trigger the analyst to consider
the facts being represented, and the indicative data
elements related to them.
This discussion reflects the fact that the nature of the
business will dictate "data type." Not all numbers and
codes will necessarily be indicative; it is based on the
"object" or "view" being identified. This also highlights
the fact that "type" refers to the logical nature of data
exclusively; not how it will be physically used. After all,
descriptive and quantitative data can be used as a physical
sort/access key just as well as any indicative data element can.
C. FORMS OF DATA
Data comes in two forms: Primary and Generated.
PRIMARY data refers to data in its virgin state; as
introduced to the system from an external source (such as
a person or department). "Source" defines who is
responsible for entering the data to a system, and who
has ultimate authority for the definition of the data
element. Depending on how well the data element is
defined, it may have either one or many "sources." For example, "Customer Name" may have
one source, the Customer Services area. Conversely, a generalized data
element, such as "Name," may have many sources depending on circumstances.
GENERATED data refers to data that relies on other data
elements in order to produce the necessary result. This type of data can involve
elaborate calculations and algorithms (e.g., DD-1 + DD-2 = DD-3). "Net Pay,"
"Balance Amount" and "Percent Complete" are some examples of calculated data.
Another form of generated data is a "Group" data item
that represents a specific string of data elements in
a assigned format. For example, "Credit Card Number"
typically consists of "Financial Institution ID," "Bank
Region Number," "Bank Branch Office ID," and "Account
Number." There are many other examples of "group" data: such as telephone
numbers, product identification codes, public utility account numbers, etc.
It is a common misconception that group data elements should
be used for basic groupings in logical records; THEY SHOULD
NOT! Group data is used as a convenient means to describe
dependencies between primary data elements. As such, a group data element
provides tremendous insight into objects and views. For example, consider
the objects and views associated with "Telephone Number."
Observe the dependencies between the three views. Each
has an impact on the others. Should the first view be deleted,
the second and third views will also be deleted. From this
perspective, the basic grouping defines dependencies
and eliminates the problem of multiple occurrences.
Data elements such as "Telephone Number" and "Credit Card
Number" should only be defined as group items if they
truly represent a concatenation of indicative data elements
representing objects. For example, "Telephone Number" is a
valid group item to identify a "Communications Area" and its
views for a telephone company. But if "Communication Area"
is not a pertinent object to your business, there is little
point in defining it as a group item. Instead, it is a simple primary value.
Group data may not be pertinent for logical records and files, but in
certain situations it can be a convenient sort/access key for physical files.
D. DATA DEPENDENCIES
Aside from data-to-data relationships required to produce
generated data, another form of data dependency is required
when defining primary data. The purpose for defining this
dependency is to supply additional meaning to subordinate
descriptive and quantitative data definitions. For example,
it is reasonable to assume that there is a relationship
between a "Customer Number" data element and a "Name" data element -
"Name" DEPENDS on "Customer Number" to give it meaning ("Customer Name").
HOW A PRIMARY DATA ELEMENT IS DEFINED IS BASED ON THE DATA
ELEMENTS IT DEPENDS ON.
Logical data dependencies must be explicitly defined. This
intelligence will be required for creating logical records ("Views").
A data element may require a dependency on more than one
data element. For example, a "Quantity Ordered," may
depend on "Order Number," "Product Number," and "Customer
Number" to give it meaning and uniquely identify a single
occurrence of "Quantity Ordered."
Descriptive and quantitative data have dependencies with
indicative data elements:
- They will require one or more "object" type data elements.
- They may require one "view identifier" type data element to
uniquely identify an occurrence of data.
Dependencies between indicative data elements should also be established to reflect
how "view identifiers" depend on superior "object identifiers" (for example, "Contact
Number" to "Customer Number").
Dependencies between primary indicative data elements that are prohibited include:
1. An "object" to "object" data element relationship is prohibited since this
will be defined through logical records.
2. A "view identifier" to "view identifier" data element
relationship is prohibited since this would create an abnormal logical data base design.
E. PROBLEMS IN DATA DEFINITION
Defining data elements is not always easy. Quite often a data element's assignment
may be different than its business purpose. For example, a financial institution may
require "Mother's Maiden Name" from an account holder. In reality, the financial
institution is not really interested in the mother as it is in establishing a unique
"Security Password" to validate the account holder in case of an emergency. "Security
Password," thereby is the data element, "Mother's Maiden Name" is how it is assigned.
Another common example is "Social Security Number" as used in the United States. This
is a number used by the federal government to identify each citizen for retirement benefits.
Many companies will use the number to uniquely identify each employee as opposed to
inventing a separate numbering convention. In this situation, the data element is
actually "Employee Number," but the number is assigned by the U.S. Social Security
Administration.
Another common problem is to create an indicative item that is bound to a physical
input/output as opposed to an object. The classic examples here are "Check Number" and
"Deposit Slip Number" as used in banking. In reality, a check or deposit slip are
physical inputs for recording a "Debit" or "Credit." Obviously, there are other ways
of creating "Debits" and "Credits," particularly with electronic banking (automatic
funds transfer for example). Under this scenario, checks and deposit slips are not used;
therefore, "Check Number" and "Deposit Slip Number" are invalid. The actual data
elements are "Debit Number" and "Credit Number."
PHYSICAL CHARACTERISTICS
The physical definition of data is perhaps easier to
comprehend by the average programmer and data base
administrator. It includes such things as:
- Length - defines the maximum number of characters that can
be assigned to a data element.
- Class - defines the type of characters used to express a
data element, e.g., alphabetic, numeric, alphanumeric, signed numeric, etc.
- Justification - defines the alignment of data within a
field when the number of characters is less than the length
of the receiving field, e.g., left, right, around the decimal point.
- Fill Character - defines the character to complete a field
when the data item is shorter than the maximum length, e.g., blank, zero, X, etc.
- Void Character - defines the character to be used when a data
item's value is unknown or non-existent, e.g., blank, zero, X, etc.
- Unit of Measure - defines the representation of numeric data,
e.g., area, volume, weight, length, time, energy rate, money, etc.
- Precision - defines for numeric data the number of
significant digits in a number.
- Scale - defines for numeric data the placement of the
decimal point.
- Base - defines for numeric data the radix used for
representing the number in programming, e.g., decimal, binary, octal,
hexadecimal, etc.
- Mode - defines the format (and type) of a data element for
programming, e.g., fixed point integer, floating point, double precision
floating point, complex, binary, packed decimal, polar coordinates, etc.
- Picture - defines how the data element is expressed for
programming. It is typically based on length, class, precision and scale.
- Program Label - defines the proper name of the data element
as it will be referred to in a programming language, such as COBOL, FORTRAN,
PL/1, Assembler, C, Pascal, ADA, etc. One data element may have many
program labels.
- Validation Rules - defines specific values which the data
element may assume. For example, Yes/No, specific codes or
numbers to be used, editing/syntactical rules, etc.
Although Systems Engineering is primarily concerned with
logical specifications, they will also provide assistance in
gathering physical specifications, particularly when they format
inputs and outputs.
DATA TAXONOMY & DOMAINS
The management of any resource requires the development of
a classification system. Financial resources are typically
arranged according to a chart of accounts; material and human
resources are categorized by type. In science, everything from
chemical elements to the animal kingdom are organized according
to a class structure. There obviously is a purpose to uniquely
identify common elements; to provide for the ability to
distinguish one from another, and eliminate redundancy.
In all instances, classification is based on the inherent
characteristics of the element.
A Data Taxonomy is a hierarchical structure that
separates data into specific classes of data based on common
characteristics. The taxonomy represents a convenient way to
classify data to prove that it is unique and without redundancy.
This includes both primary and generated data elements.
CLASSIFYING DATA
The objective is to eliminate redundancies
and promote sharing/integration
DOMAIN - Elements with similar characteristics
The lowest level in the classification hierarchy
represents what is commonly referred to as the "domain" of a
collection of data elements, one or more, with common
characteristics. For example, "text" related data elements
would be in one domain, "weights" in another, "percentages"
in another, "monetary values" in another, etc.
The domain also defines the standard physical
characteristics and values the data may assume. For example,
we could establish that all "location" values are alphanumeric,
left justified, with blank fill and void characters. In other
words, data elements such as "Address," "City," and "State"
should assume these physical characteristics for consistency.
If a data element does not have the standard logical and
physical characteristics, it must belong to another "domain."
In the situation where a data element may have only one logical
definition, but multiple physical definitions, its primary
physical definition must first conform to the Domain standards
before it can be deviated from in an application record. In
other words, the primary physical representation of "Unit Cost"
is expressed as an eight character numeric to conform to the
"currency" domain. However, in one application, a user desires
the data element be expressed as a ten character numeric. It
is the same logical data element with just another form of
physical expression.
With a classification system in place, data elements can
then be uniquely and consistently defined. When this is done,
we then have a basis for checking data redundancy. Also, when
a data element has been properly specified in this manner, it
becomes rather simple to locate it again for use in other
applications.
Classifying data helps to fulfill one of the the major
objectives of Data Resource Management: to eliminate
redundancy and promote the re-use of resources in applications.
SUMMARY OF MAJOR DBEM CONCEPTS
- DATA IS A RESOURCE THAT MUST BE MANAGED AND CONTROLLED LIKE
ANY OTHER RESOURCE. SYSTEMS COMMUNICATE THROUGH DATA.
- THE MISSION OF DATA RESOURCE MANAGEMENT IS TO STANDARDIZE AND
CONTROL DATA RESOURCES IN THE MOST COST-EFFECTIVE MEANS
POSSIBLE.
- BASIC CONSTRUCTS - Data Base, Files, Records, Data Elements
LOGICAL PHYSICAL
--------------
Represents an "object" | FILE | Represents how data is
of the business. -------------- stored.
|
--------------
Represents views of | RECORD | Represents an area
the "object." -------------- within the File.
|
Represents an individual -------------- Represents an individual
element about an object. |DATA ELEMENT| element within a record.
--------------
There is not necessarily a one-to-one relationship between
logical files and physical files. However, the logical is used
to design the physical.
- THERE ARE TWO TYPES OF "OBJECTS": Facts and Events.
- FACTS are Name/Location oriented (tangible things).
- EVENTS are Date/Time oriented (intangible actions) and
represent some form of interaction between two or more
factual objects.
- AN OBJECT HAS ONE DATA ELEMENT USED TO UNIQUELY IDENTIFY
THE OVERALL OBJECT, THIS IS REFERRED TO AS THE "Primary
Basic Grouping."
- A FACTUAL OBJECT TYPICALLY RELATES TO ANOTHER FACTUAL OBJECT
THROUGH AN EVENT:
CUSTOMER ------------- ORDER ------------- PRODUCT
- THREE TYPES OF VIEWS WITHIN AN OBJECT:
- IDENTIFICATION VIEW - all objects will have one.
- CHARACTERISTIC VIEW - describes an object and is typically
associated with a factual object.
- RELATIONSHIP VIEW - establishes a relationship between
two or more objects. Will typically
apply to event related objects.
- ONLY PRIMARY DATA IS STORED IN A LOGICAL RECORD; GENERATED
DATA CAN BE DERIVED FROM PRIMARY DATA.
- BASIC GROUPING: The key to a logical record.
It is used...
- - As the principal criteria for combining logical records
into logical files (based on the primary basic grouping
data element).
- - As the principal criteria for establishing relationships
between logical records (based on secondary keys).
- - To give meaning to descriptive and quantitative data.
- THE ASSIGNMENT OF THE BASIC GROUPING INCLUDES:
- - The Primary Basic Grouping; a primary/indicative data
element used to identify an object in its entirety.
- - Secondary Key, to either:
- Establish a relationship to other objects (a Foreign Key);
object-oriented data elements are thereby used.
- To identify a specific view within an object (a qualifying
key to note a single occurrence of data). A view-identifier
data element is used.
- THERE ARE FOUR DATA BASE MODELS;
- ALDBM - Application Logical Data Base Model - all of the
primary data needed to satisfy the information
requirements of a system.
- ELDBM - Enterprise Logical Data Base Model - all of the
primary data needed to satisfy all of the applications (the
global view).
- EPDBM - Enterprise Physical Data Base Model - represents
how all corporate data is physically stored.
- APDBM - Application Physical Data Base Model - represents
how data for a single system is physically stored.
It also represents a subset of the EPDBM.
- DATA HAS ONE LOGICAL DEFINITION, BUT CAN HAVE MORE THAN ONE
PHYSICAL REPRESENTATION.
- TYPES OF DATA:
- INDICATIVE - to uniquely identify and control objects, in
part or in full. This will include data elements to either
identify a whole object or a single view.
- DESCRIPTIVE - to describe objects.
- QUANTITATIVE - numeric values used in calculations.
- FORMS OF DATA:
- PRIMARY - data assigned from a user area; outside a system.
- GENERATED - data derived from other data values; either from
calculations or group (concatenated data).
- ONLY PRIMARY/INDICATIVE DATA CAN BE USED IN THE BASIC
GROUPING OF A LOGICAL RECORD. Group data elements cannot.
- DATA TAXONOMY - a hierarchical structure used to classify
data elements. The intent is to eliminate redundancy.
- DOMAIN - the lowest level in the Data Taxonomy. A collection
of data elements exhibiting common characteristics.
DATA RESOURCE MANAGEMENT: THE FUNCTION
The scope of Data Resource Management is much more
encompassing than most people envision. It represents a large
investment by a company. This is partially the reason why few
companies have succeeded in this area, they simply fail to
comprehend the magnitude of the function and its importance.
One of the mistakes made when implementing Data Resource
Management historically has been that the task has been
delegated to technicians. The "tool approach" to managing data
has emphasized the physical data base considerations but little
has been accomplished in regard to the logical side. This is
the principal reason why the proliferation of DBMS technology
has been so widespread. Another symptom of the problem is the
use of the term 'Data Base Administrator' which is clearly a
technical task.
The management of data requires centralized control to
coordinate the use of resources. This is not to suggest a
centralized data base. A company could have a distributed data
base spread throughout various locations. It simply means the
function is best served through a focal point. This is no
different than the function of materials management which is
concerned with coordinating the use of all parts, regardless of
where used or how stored.
A centralized Data Resource Management function can better
promote and control the exchange of data between applications
than a decentralized function which would only have a partial
view of the corporate data base. Centralization would be able
to maintain the data base models more effectively than if they
were left to separate operating units. It would also be able to
enforce data base design standards on a more consistent basis.
One of the more controversial subjects in Data Resource
Management pertains to the "ownership" of data. There are
those who suggest that data "belongs" to the various users or
departments of a company. This is like saying money is the
property of the sales or accounting departments, not the
company. Data belongs to the enterprise as a whole and not to
any single person or department. Of course, how data is
accessed should be controlled and safeguarded, just as we would
do with any other resource. This is a responsibility for Data
Resource Management to perform.
The concepts of "end-user computing," "Data Mining" and the "Information
Center" are totally dependent on the integrity of the data base.
Without effective Data Resource Management, they would not be
possible. Users could access erroneous or unauthorized data
which could present serious financial and security problems to
a company.
RESPONSIBILITIES OF DATA RESOURCE MANAGEMENT
There are essentially seven responsibilities associated
with the Data Resource Management function:
1. To eliminate data definition redundancy.
This does not mean the elimination of duplicate data,
only the elimination of duplicate definitions. In many
instances, it may be more practical to physically store
redundant data in various locations. This decision is
based on what is effective for fulfilling the needs of
an application.
Data should be defined properly one time and then re-used
as often as there is an application need for it. This is
accomplished by classifying data resources and controlling
the four data base models.
2. To satisfy the data needs of all applications and
promote data sharing.
Here again, data base is defined as all of the data in
an organization required to produce information, regardless
of where used or how stored. In any given organization,
there is a finite number of primary data elements, not
infinite. The logical corporate data base will keep
expanding until this objective is reached. When this
finally occurs, the company is in an enviable position.
Users and the systems staff will be able to implement new
information requirements simply by adjusting timing and
creating new combinations of data.
3. To design the data base to be easy to maintain
and modify.
The four integrated data base models provides invaluable
assistance in this regard. They have the ability to expand
and contract as the business and applications change. It
also provides the means to scope and isolate problem areas
in the corporate data base.
Mapping the four models can be best served by using matrices
to express the extensive relationships.
4. To design the physical data base in the most efficient and
cost effective means possible.
Data Resource Management should be concerned with all file
management tools and techniques, not just one. Indexed
sequential files, sequential files, etc. are just as
important as any DBMS file. The function should be just
as concerned with the organization and filing techniques
of manual files as they are with computer files.
Unfortunately, this is not the situation in most companies
today. The tools and techniques selected should be based
on such things as required processing speed, anticipated
transaction volume, security, and other performance
considerations.
5. To design the data base to be independent from applications
and physical environments.
Data Resource Management must be careful not to impose a
dependency on particular hardware or software. In some
situations, this may not be avoidable. In this event,
conversion options should be explored and planned.
Obsolescence in the areas of storage devices and techniques
can mitigate against data base planning and create future
problems. Also, Data Resource Management should avoid any
short term solutions that may create long term problems.
This applies to designing a Data Base to meet some
particular application need without consideration for the
overall needs of the other applications. The Data Resource
Manager should have as objectives the complete independence
of the data base from hardware, system software and applications.
6. Cooperate with other IRM functions.
Data Resource Management is a function that does not operate
in isolation. It works closely with the other functions
associated with the disciplines of information systems
engineering and enterprise engineering. For example, the
application logical data base design developed by Systems
Engineering is checked for accuracy.
The specification of the logical designs will result in a
physical implementation requiring hardware and software
which may not exist or will require modification. Data
Resource Management must consider this carefully and advise
Systems Engineering accordingly. When determining project
costs, Project Management must be advised by Data Resource
Management of the pertinent data base costs. Also, the
application physical data base design must be delivered to
Software Engineering prior to programming.
For Enterprise Engineering, Data Resource Management must
create skeletal definitions of the objects required to
operate and manage the enterprise.
7. To control all data resources.
In order to provide an accurate accounting of all
resources, a "bill of materials" type of system is required
to catalog, classify, and cross-reference components to
where they are used. This is the intent of an Information
Resource Manager (IRM), a software tool used to inventory
and track the use of organizational resources, systems
resources, and data resources. This type of tracking could
be performed manually, but this would create a large and
cumbersome trail of paper.
Acting as a Bill of Material Processor (BOMP), an IRM
provides tremendous analytical capabilities, particularly in
the area of "impact analysis," which permits a user to
evaluate the effect of a change to a resource as it applies
to the other information resources in an enterprise. It can
also be used to maintain documentation on the various
resources. As changes are made, the documentation is
automatically updated. Because of the extensive resource
intelligence contained in the IRM, it can be used to drive
multiple physical data base management systems.
Because the IRM represents the central location for a
company's information knowledge, it becomes a tool used in
enterprise engineering and information systems engineering,
as well as data base engineering.
Selling the concept of Data Resource Management to
corporate executives is a relatively simple task, as long as
it is communicated in management terms, not technical jargon.
Data Resource Management will not be successful as long as
executives view it as a technical function. However, if
management understands and accepts the fact that Data Resource
Management is simply another form of materials management, a
vital part of the "Information Factory" concept, they will
understand and support an effective Data Resource Management
function.
METHODOLOGY CONSTRUCTION/NAVIGATION
The Data Base Engineering Methodology (DBEM) consists of an
assembly of six phases detailing what is to be accomplished and
by whom. Each phase consists of a defined set of activities (a
total of 24); each activity consists of a series of operations
or tasks to be performed. All phases, activities and tasks
produce tangible deliverables that can be reviewed and checked.
These deliverables substantiate adherence to the methodology and
permits the measurement of progress. Both formal and informal
review points are contained throughout the methodology which
provides for the effective dialog between management and data
base engineers.
DBEM emphasizes design correctness and the production of a
quality product. The first phase is essentially used to plan
the DBEM project. The remaining phases map the four data base
models as mentioned earlier. The final phase (6) is used to
evaluate the DBEM project.