Efficient graphics storage using group-based compression solid resolution

0 0 5 minutes read

Efficient graphics storage using group-based compression solid resolution

One of the core challenges in resolutions (ERs) is managing and maintaining complex relationships between records. Tilores takes a graph as a graph: each node represents a record, while the edge represents a rule match between these records. This approach provides us with flexibility, traceability and high accuracy, but it also presents significant storage and computing challenges, especially at scale. This article describes detailed information about efficient storage of height connection graphs using group-based graphics compression.

Solid diagram model

In Tilores, a valid entity is a graph where each record is connected to at least one by a matching rule. For example, if the record a Match record b According to the rules R1we store it as an edge "a:b:R1". If another rule says R2also connected a and bwe store an additional advantage "a:b:R2". These edges are saved as a simple list, but can be modeled using an adjacency list structure for more efficient storage.

Why keep all edges?

Most entity resolution systems or master data management systems do not retain the relationship between records, but store only the representation of the underlying data, usually a universal match score, leaving the user unsure how the entity is formed. Worse, users cannot correct the mistakes made by the automatic matching system.

Therefore, keeping all edges in solid diagrams has multiple purposes:

Traceability: Allows the user to understand why two records are grouped into the same entity.
analyze: Insights such as rule validity and data similarity can be extracted from edge metadata.
Data deletion and recalculation: When deleting records or modifying rules, the graph must be recalculated. Edge information is essential to know how entities are formed and how they are updated.

Extended Problem: Secondary Growth

When discussing potential scaling issues in entity resolution, this usually refers to the challenge of matching each record to all other records. While this is a challenge in itself, retaining all edges of an entity can lead to similar problems in storage. Many entities that record interconnected each other create many edges. In the worst case, each new record is connected to all existing records. This secondary growth can be expressed in formulas:

n * (n - 1) / 2

For small entities, this is not a problem. For example, an entity with 3 records can have up to 3 edges. For n = 100, this increases to 4,950 edges, and for n = 1,000, this will result in up to 499,500 edges.

This creates huge storage and computing overhead, especially since solid resolution maps often exhibit this dense connectivity.

Solution: Group-based Graphic Compression (CBGC)

One group in the diagram is a group of nodes, where each node is connected to other nodes in the group. A group can also be called a complete sub-picture. The smallest possible group contains a node without edges. A pair of nodes connected through the edge also forms a group. and three nodes (such as the node below) form a collection of triangles.

A simple group: triangle
(Image of the author)

The largest group is a group that cannot be expanded by adding any adjacent nodes, while the largest group is the group with the largest number of nodes in the entire graph. For the purposes of this article, we will use this term to refer to a group of at least three nodes.

The triangles previously displayed can be represented in Tilores by the following edges:

[
  "a:b:R1",
  "a:c:R1",
  "b:c:R1"
]

Because triangles are a group, we can also represent the graph by storing only the nodes and related rule IDs in that group:

{
  "R1": [
    ["a", "b", "c"]
  ]
}

Let’s consider the following more complex diagram:

Complete the sub-graph with 6 nodes
(Image of the author)

Depending on its appearance, we can easily find that all nodes are connected to each other. So instead of listing all 15 edges [remember n*(n-1)/2]we can simply store this group in the following form:

{
  "R1":[
    ["a", "b", "c", "d", "e", "f"]
  ]
}

However, in the real picture, not all records are connected to each other. Consider the following figure:

A complex graph with three highlighting groups
(Image of the author)

There are three larger groups highlighted: yellow, red and blue (if you are picky). There is still one remaining node. While these are probably the largest groups, you may find dozens of other groups. For example, did you notice a 4-node group between two red and two yellow nodes?

Stick to color group, we can store them in the following ways (yellow, red and blue with Y, R and B):

{
  "R1": [
    ["y1", "y2", "y3"],
    ["r1", "r2", "r3", "r4", "r5"],
    ["b1", "b2", "b3", "b4", "b5", "b6"]
  ]
}

Additionally, we can store the remaining 10 edges (Purple P):

[
  "y1:r1:R1",
  "y1:r2:R1",
  "y2:r1:R1",
  "y2:r2:R1",
  "r4:p1:R1",
  "r5:p1:R1",
  "r5:b1:R1",
  "b2:p1:R1",
  "y3:b5:R1",
  "y3:b6:R1"
]

This means that now only three groups and ten edges can be used to represent the entire graph, rather than the original 38 edges.

This group-based graphical compression (CBGC) is loss-free (unless you need edge properties). In realistic datasets, we identified a large amount of storage savings. For one customer, CBGC reduced edge storage space by 99.7%, replacing hundreds of thousands of edges with just a few hundred groups and sparse edges.

Performance advantages beyond storage

CBGC is more than just compression. It also enables faster operations, especially when handling records and edge deletion.

If a unique link between two subgraphs is removed, any SANE entity resolution engine should divide an entity into multiple entities, for example for regulatory or compliance reasons. Connected component algorithms are often used to identify individual unconnected subgraphs. In short, it groupes all nodes connected through edges to separate subgraphs. As a result, each edge needs to be checked at least once.

However, if you store the graph as a compressed graph, you don’t need to cross all the edges of the group. Instead, adding a limited number of edges to each group is sufficient, such as a horizontal path between group nodes, treating each group as a pre-connected subgraph.

Trade-off: Group detection complexity

There is a trade-off: group detection is computationally expensive, especially in trying to find the largest group, which is a well-known NP problem.

In practice, this effort is usually sufficient to simplify. Approximate group detection algorithms (e.g., greedy heuristics) perform enough for most purposes. Furthermore, CBGC is usually selectively recalculated when the number of edges of an entity exceeds a threshold. This hybrid approach can balance compression efficiency with acceptable processing costs.

Beyond the Group

It can be said that the most common pattern in solid resolution is the complete subgraph. However, it can be done by identifying other repetitive patterns, e.g.

Stars: Store as a list of nodes, where the first entry represents the central node
Path: Stores an ordered list as nodes
Community: Store and mark missing edges like a group

End thought

Solid resolution systems often face the challenge of managing intensive, highly interconnected graphs. Storage all edges quickly become unsustainable. CBGC provides an efficient way to model entities by leveraging the structural properties of data.

Not only does it reduce overhead storage space, it also improves system performance, especially during data deletion and reprocessing. Although Group Inspection has its computational cost, careful engineering choices allow us to gain benefits without sacrificing scalability.

liralbes 13 hours ago

0 0 5 minutes read