Chapter 3 Database Systems, Data Warehouses, and Data Marts Learning Objectives Define a database and a database management system. Explain logical database design and the relational database model. Define the components of a database management system. Summarize recent trends in database design and use. Explain the components and functions of a data warehouse. Describe the functions of a data mart. Define business analytics and describe its role in the decision-making process. Explain big data and its business applications. Detailed Chapter Outline I. Databases A database is a collection of related data that is stored in a central location or in multiple locations. In a database, a file is a group of related records, and a record is a group of related fields. This structure is called a data hierarchy. In a database system, all files are integrated, meaning information can be linked. A database is a critical component of information systems because any type of analysis that is done is based on data available in the database. To make using databases more efficient, a database management system (DBMS) is used. A DBMS is software for creating, storing, maintaining, and accessing database files. In the past, data was stored in a series of files that were called “flat files” because they were not arranged in a hierarchy, and there was no relationship among these files. The problem with this flat file organization was that the same data could be stored in more than one file, creating data redundancy. Some of the advantages of a database over a flat file system are as follows: More information can be generated from the same data. Complex requests can be handled more easily. Data redundancy is eliminated or minimized. Data management is improved. Storage space is reduced. A. Types of Data in a Database To generate business intelligence (BI), the database component of an information system needs access to two types of data: internal and external. Internal data is collected from within an organization and can include transaction records, sales records, personnel records, and so forth. Internal data is usually stored in the organization’s internal databases and can be used by functional information systems. External data comes from a variety of sources and is often stored in a data warehouse. Competitors, customers, suppliers, distribution networks, and tax records are some examples of sources for external data. B. Methods for Accessing Files In a database, files are accessed by using a sequential, random, or indexed sequential method. In a sequential access file structure, records in files are organized and processed in numerical or sequential order, typically the order in which they were entered. Records are organized based on what is known as a “primary key.” In a random access file structure, records can be accessed in any order, regardless of their physical locations in storage media. This method of access is fast and very effective when a small number of records needs to be processed daily or weekly. To achieve this speed, these records are often stored on magnetic disks. Disks are random access devices, whereas tapes are sequential access devices. With the indexed sequential access method (ISAM), records can be accessed sequentially or randomly, depending on the number being accessed. ISAM, as the name suggests, uses an index structure and has two parts: the indexed value and a pointer to the disk location of the record matching the indexed value. Retrieving a record requires at least two disk accesses, once for the index structure and once for the actual record. Because every record needs to be indexed, if the file contains a huge number of records, the index is also quite large. Therefore, an index is more useful when the number of records is small. Access speed with this method is fast, so it is recommended when records must be accessed frequently. II. Logical Database Design Before designing a database, one needs to know the two ways information is viewed in a database. The physical view involves how data is stored on and retrieved from storage media, such as hard disks, magnetic tapes, or CDs. For each database, there is only one physical view of data. The logical view involves how information appears to users and how it can be organized and retrieved. The first step in database design is defining a data model, which determines how data is created, represented, organized, and maintained. A data model usually includes the following three components: Data structure—describes how data is organized and the relationship among records Operations—describes methods, calculations, and so forth that can be performed on data, such as updating and querying data Integrity rules—defines the boundaries of a database, such as maximum and minimum values allowed for a field, constraints (limits on what type of data can be stored in a field), and access methods Many data models are used. In a hierarchical model, the relationships among records form a treelike structure (hierarchy). Records are called nodes, and relationships among records are called branches. The node at the top is called the root, and every other node (called a child) has a parent. Nodes with the same parents are called twins or siblings. The network model is similar to the hierarchical model, but records are organized differently. This model links invoice number, customer number, and method of payment. Unlike the hierarchical model, each record in the network model can have multiple parent and child records. A. The Relational Model A relational model uses a two-dimensional table of rows and columns of data. Rows are records (also called tuples), and columns are fields (also referred to as attributes). To begin designing a relational database, one must define the logical structure by defining each table and the fields in it. The collection of these definitions is stored in the data dictionary. The data dictionary can also store other definitions, such as data types for fields, default values for fields, and validation rules for data in each field. In a relational database, every record must be uniquely identified by a primary key. Student ID numbers, Social Security numbers, account numbers, and invoice numbers are examples of primary keys. To establish relationships among tables so data can be linked and retrieved more efficiently, a primary key for one table can appear in other tables. In this case, it is called a foreign key. To improve database efficiency, a process called normalization is used, which eliminates redundant data and ensures that only related data is stored in a table. Normalization can go through several stages, from first normal form (1NF) to fifth normal form (5NF). Typically, however, only stages 1NF through 3NF are used. Data stored in a relational model is retrieved from tables by using operations that pick and combine data from one or more tables. There are several operations, such as select, project, join, intersect, union, and difference. A select operation searches data in a table and retrieves records based on certain criteria (also called conditions). A project operation pares down a table by eliminating columns (fields) according to certain criteria. A join operation combines two tables based on a common field (e.g., the primary key in the first table and the foreign key in the second table). III. Components of a DBMS A. Database Engine A database engine, the heart of DBMS software, is responsible for data storage, manipulation, and retrieval. It converts logical requests from users into their physical equivalents (e.g., reports) by interacting with other components of the DBMS (usually the data manipulation component). Because more than one logical view of data is possible, the database engine can retrieve and return data to users in many different ways. B. Data Definition The data definition component is used to create and maintain the data dictionary and define the structure of files in a database. Any changes to a database’s structure, such as adding a field, deleting a field, changing a field’s size, or changing the data type stored in a field, are made with this component. C. Data Manipulation The data manipulation component is used to add, delete, modify, and retrieve records from a database. Typically, a query language is used for this component. Many query languages are available, but Structured Query Language (SQL) and query by example (QBE) are two of the most widely used. Structured Query Language (SQL) is a standard fourth-generation query language used by many DBMS packages, such as Oracle Database 11g and Microsoft SQL Server. SQL consists of several keywords specifying actions to take. The basic format of an SQL query is as follows: SELECT field FROM table or file WHERE conditions With query by example (QBE), data is requested from a database by constructing a statement made up of query forms. With current graphical databases, users simply click to select query forms instead of having to remember keywords, as they would do with SQL. Users can add AND, OR, and NOT operators to the QBE form to fine-tune the query. AND—means that all conditions must be met. OR—means only one of the conditions must be met. NOT—searches for records that do not meet the condition. D. Application Generation The application generation component is used to design elements of an application using a database, such as data entry screens, interactive menus, and interfaces with other programming languages. Typically, IT professionals and database administrators use this component. E. Data Administration The data administration component, also used by IT professionals and database administrators, is used for tasks such as backup and recovery, security, and change management. In addition, this component is used to determine who has permission to perform certain functions, often summarized as create, read, update, and delete (CRUD). In large organizations, database design and management is handled by the database administrator (DBA), although with complex databases this task is sometimes handled by an entire department. The DBA’s responsibilities include: Designing and setting up a database Establishing security measures to determine users’ access rights Developing recovery procedures in case data is lost or corrupted Evaluating database performance Adding and fine-tuning database functions IV. Recent Trends in Database Design and Use Recent trends in database design and use include data-driven Web sites, natural language processing, distributed databases, and object-oriented databases. In addition to these trends, advances in artificial intelligence and natural language processing will have an impact on database design and use, such as improving user interfaces. A. Data-Driven Web Sites With the popularity of e-commerce applications, data-driven Web sites are used more widely to provide dynamic content. A data-driven Web site acts as an interface to a database, retrieving data for users and allowing users to enter data in the database. Without this feature, Web site designers must edit the HTML code every time a Web site’s data contents change. This type of site is called a “static” Web site. A data-driven Web site, on the other hand, changes automatically because it retrieves content from external dynamic data sources, such as MySQL, Microsoft SQL Server, Microsoft Access, Oracle, IBM DB2, and other databases. A data-driven Web site improves access to information so users’ experiences are more interactive, and it reduces the support and overhead needed to maintain static Web sites. Data-driven Web sites are useful for the following applications, among others: E-commerce sites that need frequent updates News sites that need regular updating of content Forums and discussion groups Subscription services, such as newsletters B. Distributed Databases The database types discussed so far use a central database for all users of an information system. However, in some situations, a distributed database management system (DDBMS), in which data is stored on multiple servers placed throughout an organization, is preferable. Some of the reasons an organization would choose a distributed database are as follows: Local storage of data decreases response time but increases communication costs. Several small integrated systems might cost less than one large server. Most importantly, a distributed database is not limited by the data’s physical location. There are three approaches to setting up a DDBMS, although these approaches can be combined: Fragmentation—the fragmentation approach addresses how tables are divided among multiple locations. Horizontal fragmentation breaks a table into rows, storing all fields (columns) in different locations. Vertical fragmentation stores a subset of columns in different locations. Mixed fragmentation, which combines vertical and horizontal fragmentation, stores only site-specific data in each location. Replication—with the replication approach, each site stores a copy of data in the organization’s database. Allocation—the allocation approach combines fragmentation and replication. Security issues are more challenging in a distributed database because of multiple access points from both inside and outside the organization. Security policies, scope of user access, and user privileges must be clearly defined, and authorized users must be identified. C. Object-Oriented Databases The relational model discussed previously is designed to handle homogenous data organized in a field-and-record format. Representing more complex data relationships sometimes is not possible with a relational database. To address these problems, object-oriented databases were developed. Like object-oriented programming, this data model represents real-world entities with database objects. An object consists of attributes (characteristics describing an entity) and methods (operations or calculations) that can be performed on the object’s data. Grouping objects along with their attributes and methods into a class is called encapsulation, which essentially means grouping related items into a single unit. Encapsulation helps handle more complex types of data, such as images and graphs. Object-oriented databases can also use inheritance, which means new objects can be created faster and more easily by entering new data in attributes. This data model expands on the relational model by supporting more complex data management, so modeling real-world problems is easier. In contrast to the query languages used to interact with a relational database, interaction with an object-oriented database takes places via methods, which are called by sending a message to an object. V. Data Warehouses A data warehouse is a collection of data from a variety of sources used to support decision-making applications and generate business intelligence. Data warehouses store multidimensional data, so they are sometimes called “hypercubes.” Typically, data in a data warehouse is described as having the following characteristics in contrast to data in a database: Subject oriented—focused on a specific area, such as the home-improvement business or a university, whereas data in a database is transaction/function oriented Integrated—comes from a variety of sources, whereas data in a database usually does not Time variant—categorized based on time, such as historical information, whereas data in a database only keeps recent activity in memory Type of data—captures aggregated data, whereas data in a database captures raw transaction data Purpose—used for analytical purposes, whereas data in a database is used for capturing and managing transactions Designing and implementing a data warehouse is a complex task, but specific software is available to help. Oracle, IBM, Microsoft, Teradata, SAS, and Hewlett-Packard are market leaders in data-warehousing platforms. Exhibit 3.6 shows a data warehouse configuration with four major components: input; extraction, transformation, and loading (ETL); storage; and output. A. Input Data can come from a variety of sources, including external data sources, databases, transaction files, enterprise resource planning (ERP) systems, and customer relationship management (CRM) systems. ERP systems collect, integrate, and process data that can be used by all functional areas in an organization. CRM systems collect and process customer data to provide information for improving customer service. Together, these data sources provide the input a data warehouse needs to perform analyses and generate reports. B. ETL Extraction, transformation, and loading (ETL) refers to the processes used in a data warehouse. Extraction means collecting data from a variety of sources and converting it into a format that can be used in transformation processing. Transformation processing is done to make sure data meets the data warehouse’s needs. Its tasks include the following: Selecting only certain columns or rows to load Translating coded values, such as replacing Yes with 1 and No with 2 Performing select, project, and join operations on data Sorting and filtering data Aggregating and summarizing data before loading it in the data warehouse Loading is the process of transferring data to the data warehouse. Depending on the organization’s needs and the data warehouse’s storage capacity, loading might overwrite existing data or add collected data to existing data. C. Storage Collected information is organized in a data warehouse as raw data, summary data, or metadata. Raw data is information in its original form. Summary data gives users subtotals of various categories, which can be useful. Metadata is information about data—its content, quality, condition, origin, and other characteristics. Metadata tells users how, when, and by whom data was collected and how data has been formatted and converted into its present form. D. Output The databases discussed so far support online transaction processing (OLTP) to generate reports such as the following: Which product generated the highest sales last month? Which region generated the lowest sales last month? Which salespersons increased sales by more than 30 percent last quarter? Data warehouses, however, use online analytical processing and data-mining analysis to generate reports. Online Analytical Processing Online analytical processing (OLAP), unlike OLTP, is used to quickly answer multidimensional analytical queries, thereby generating business intelligence. It uses multiple sources of information and provides multidimensional analysis, such as viewing data based on time, product, and location. Sometimes, this analysis is called slicing and dicing data. OLAP allows one to analyze information that has been summarized in multidimensional views. OLAP tools are used to perform trend analysis and sift through massive amounts of statistics to find specific information. These tools usually have a “drill-down” feature for accessing multilayer information. Data-Mining Analysis Data-mining analysis is used to discover patterns and relationships. The following are typical questions one can answer by using data-mining tools: Which customers are likely to respond to a new product? Which customers are likely to respond to a new ad campaign? What product should be recommended to this customer based on his or her past buying patterns? Vendors of data-mining software include SAP Business Objects (www.sap.com), SAS (www.sas.com), Cognos (http://cognos.com), and Informatica (www.informatica.com). Decision-Making Reports A data warehouse can generate all types of information as well as reports used for decision making. The following are examples of what a data warehouse can allow one to do: Analyze large amounts of historical data quickly. Assist management in making well-informed business decisions. Manage a high demand for information from many users with different needs and decision-making styles. VI. Data Marts A data mart is usually a smaller version of a data warehouse, used by a single department or function. Data marts focus on business functions for a specific user group in an organization, such as a data mart for the Marketing Department. Data marts have the following advantages over data warehouses: Access to data is often faster because of their smaller size. Response time for users is improved. They are easier to create because they are smaller and often less complex. They are less expensive. Users are targeted better, because a data mart is designed for a specific department or division. Data marts, however, usually have more limited scope than data warehouses, and consolidating information from different departments or functional areas (such as sales and production) is more difficult. VII. Business Analytics Business analytics (BA) uses data and statistical methods to gain insight into the data and provide decision makers with information they can act on. Compared to business intelligence (BI), BA is more forward looking; it tells the user what is going to happen in the future rather than what has happened in the past. BI can help determine what happened, what the problem is, and what decisions need to be made based on the available data. BA can help determine why this is happening, what it will mean if the sales trend continues, and what actions should be taken. BI uses dashboards, scorecards, OLAP, and query reports to support decision-making activities, whereas BA uses statistical analysis, data mining tools, and predictive modeling. There are several types of BA methods. Three popular ones are descriptive, predictive, and prescriptive analytics. Descriptive analytics reviews past events, analyzes the data, and provides a report indicating what happened in a given period and how to prepare for the future. Thus, it is a reactive strategy. Predictive analytics, as the name indicates, is a proactive strategy; it prepares a decision maker for future events. Prescriptive analytics goes beyond descriptive and predictive analytics by recommending a course of action that a decision maker should follow and showing the likely outcome of each decision. Within the general domain of analytics two other terms are being commonly used: Web analytics and mobile analytics. Web analytics collects and analyzes the Web data in order to find out the efficiency and effectiveness of a Web site. On the other hand, mobile analytics is concerned with mobile Web and mobile apps. It assesses and measures traffic among mobile devices and all the apps used by these mobile devices. The value that BA offers organizations is huge. Organizations that want to take full advantage of it, however, will need a chief analytics officer (CAO). This individual should sift through the enterprise-wide data in order to generate BI. Major providers of predictive analytics software include SAS, IBM, SAP, Microsoft, and Oracle. VIII. The Big Data Era There has been an explosion in data growth throughout the business world in recent years. In fact, it has been estimated that the volume of business data worldwide doubles every 1.2 years. Big data is data so voluminous that conventional computing methods are not able to efficiently process and manage it. There are three dimensions to big data, known as the 3 Vs: volume, variety, and velocity. Volume—this refers to the sheer quantity of transactions, measured in petabytes or exabytes. Variety—this refers to the combination of structured data and unstructured data. Velocity—this refers to the speed with which the data has to be gathered and processed. A. Who Benefits from Big Data? Many industries could benefit from big data analytics and gain a competitive advantage in the following areas: retail, financial services, advertising and public relations, government, manufacturing, media and telecommunications, energy, and healthcare. B. Tools and Technologies of Big Data Many technologies and applications have contributed to the growth and popularity of big data. Mobile and wireless technology, the popularity of social networks, and the enhanced power and sophistication of smartphones and handheld devices are among the key factors. The most commonly used platform for big data analytics is the open-source Apache Hadoop, which uses the Hadoop Distributed File System (HDFS) to manage storage. Distributed databases, including NoSQL and Cassandra, are also commonly associated with big data projects. C. Big Data Privacy Risks Big data analytics is able to provide key decision makers with unprecedented decision-making support and great potential for gaining a competitive advantage. However, this powerful tool could reveal and expose certain information that puts some people’s privacy at risk. Some of the risks associated with this technology are: discrimination, privacy breaches and embarrassments, unethical actions based on interpretations, loss of anonymity, and only few legal protections exist for the involved individuals. Key Terms A database is a collection of related data that is stored in a central location or in multiple locations. (P.47) A data hierarchy is the structure and organization of data, which involves fields, records, and files. (P.47) A database management system (DBMS) is software for creating, storing, maintaining, and accessing database files. A DBMS makes using databases more efficient. (P.47) In a sequential access file structure, records in files are organized and processed in numerical or sequential order, typically the order in which they were entered. (P.50) In a random access file structure, records can be accessed in any order, regardless of their physical locations in storage media. This method of access is fast and very effective when a small number of records need to be processed daily or weekly. (P.50) With the indexed sequential access method (ISAM), records can be accessed sequentially or randomly, depending on the number being accessed. For a small number, random access is used, and for a large number, sequential access is used. (P.50) The physical view involves how data is stored on and retrieved from storage media, such as hard disks, magnetic tapes, or CDs. (P.50) The logical view involves how information appears to users and how it can be organized and retrieved. (P.50) A data model determines how data is created, represented, organized, and maintained. It usually contains data structure, operations, and integrity rules. (P.50) In a hierarchical model, the relationships between records form a tree-like structure (hierarchy). Records are called nodes, and relationships between records are called branches. The node at the top is called the root, and every other node (called a child) has a parent. Nodes with the same parents are called twins or siblings. (P.51) The network model is similar to the hierarchical model, but records are organized differently. Unlike the hierarchical model, each record in the network model can have multiple parent and child records. (P.51) A relational model uses a two-dimensional table of rows and columns of data. Rows are records (also called tuples), and columns are fields (also referred to as attributes). (P.51) The data dictionary stores definitions, such as data types for fields, default values, and validation rules for data in each field. (P.52) A primary key uniquely identifies every record in a relational database. Examples include student ID numbers, account numbers, Social Security numbers, and invoice numbers. (P.52) A foreign key is a field in a relational table that matches the primary key column of another table. It can be used to cross-reference tables. (P.52) Normalization improves database efficiency by eliminating redundant data and ensures that only related data is stored in a table. (P.52) Structured Query Language (SQL) is a standard fourth-generation query language used by many DBMS packages, such as Oracle 11g and Microsoft SQL Server. SQL consists of several keywords specifying actions to take. (P.54) With query by example (QBE), you request data from a database by constructing a statement made up of query forms. With current graphical databases, you simply click to select query forms instead of having to remember keywords, as you do with SQL. You can add AND, OR, and NOT operators to the QBE form to fine-tune the query. (P.54) Create, read, update, and delete (CRUD) refers to the range of functions that data administrators determine who has permission to perform certain functions. (P.55) Database administrators (DBA), found in large organizations, design and set up databases, establish security measures, develop recovery procedures, evaluate database performance, and add and fine-tune database functions. (P.55) A data-driven Web site acts as an interface to a database, retrieving data for users and allowing users to enter data in the database. (P.56) A distributed database management system (DDBMS) stores data on multiple servers throughout an organization. (P.56) The fragmentation approach to a distributed DBMS addresses how tables are divided among multiple locations. There are three variations: horizontal, vertical, and mixed. (P.56) The replication approach to a distributed DBMS has each site store a copy of the data in the organization’s database. (P.56) The allocation approach to a distributed DBMS combines fragmentation and replication, with each site storing the data it uses most often. (P.57) In object-oriented databases, both data and their relationships are contained in a single object. An object consists of attributes and methods that can be performed on the object’s data. (P.57) Encapsulation refers to the grouping into a class of various objects along with their attributes and methods—meaning, grouping related items into a single unit. This helps handle more complex types of data, such as images and graphs. (P.57) Inheritance refers to new objects being created faster and more easily by entering new data in attributes. (P.57) A data warehouse is a collection of data from a variety of sources used to support decision-making applications and generate business intelligence. (P.58) Extraction, transformation, and loading (ETL) refers to the processes used in a data warehouse. It includes extracting data from outside sources, transforming it to fit operational needs, and loading it into the end target (database or data warehouse). (P.59) Online transaction processing (OLTP) is used to facilitate and manage transaction-oriented applications, such as point-of-sale, data entry, and retrieval transaction processing. It generally uses internal data and responds in real time. (P.60) Online analytical processing (OLAP) generates business intelligence. It uses multiple sources of information and provides multidimensional analysis, such as viewing data based on time, product, and location. (P.60) Data-mining analysis is used to discover patterns and relationships. (P.61) A data mart is usually a smaller version of a data warehouse, used by a single department or function. (P.62) Business analytics (BA) uses data and statistical methods to gain insight into the data and provide decision makers with information they can act on. (P.62) Big data is data so voluminous that conventional computing methods are not able to efficiently process and manage it. (P.64) Instructor Manual for MIS Hossein Bidgoli 9781305632004, 9781337625999, 9781337625982, 9781337406925
Close